Extrapolation and Neural Networks
A problem with random forests, like all machine learning or deep learning algorithms, is that they don’t always generalize well to new data. We will see in which situations neural networks generalize better, but first, let’s look at the extrapolation problem that random forests have.
The Extrapolation Problem
In [83]:
#hide
np.random.seed(42)
Let’s consider the simple task of making predictions from 40 data points showing a slightly noisy linear relationship:
In [84]:
x_lin = torch.linspace(0,20, steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
plt.scatter(x_lin, y_lin);
Although we only have a single independent variable, sklearn expects a matrix of independent variables, not a single vector. So we have to turn our vector into a matrix with one column. In other words, we have to change the shape from [40]
to [40,1]
. One way to do that is with the unsqueeze
method, which adds a new unit axis to a tensor at the requested dimension:
In [85]:
xs_lin = x_lin.unsqueeze(1)
x_lin.shape,xs_lin.shape
Out[85]:
(torch.Size([40]), torch.Size([40, 1]))
A more flexible approach is to slice an array or tensor with the special value None
, which introduces an additional unit axis at that location:
In [86]:
x_lin[:,None].shape
Out[86]:
torch.Size([40, 1])
We can now create a random forest for this data. We’ll use only the first 30 rows to train the model:
In [87]:
m_lin = RandomForestRegressor().fit(xs_lin[:30],y_lin[:30])
Then we’ll test the model on the full dataset. The blue dots are the training data, and the red dots are the predictions:
In [88]:
plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);
We have a big problem! Our predictions outside of the domain that our training data covered are all too low. Why do you suppose this is?
Remember, a random forest just averages the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time. Your predictions will be systematically too low.
But the problem extends beyond time variables. Random forests are not able to extrapolate outside of the types of data they have seen, in a more general sense. That’s why we need to make sure our validation set does not contain out-of-domain data.
Finding Out-of-Domain Data
Sometimes it is hard to know whether your test set is distributed in the same way as your training data, or, if it is different, what columns reflect that difference. There’s actually an easy way to figure this out, which is to use a random forest!
But in this case we don’t use the random forest to predict our actual dependent variable. Instead, we try to predict whether a row is in the validation set or the training set. To see this in action, let’s combine our training and validation sets together, create a dependent variable that represents which dataset each row comes from, build a random forest using that data, and get its feature importance:
In [89]:
df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))
m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]
Out[89]:
cols | imp | |
---|---|---|
6 | saleElapsed | 0.891571 |
9 | SalesID | 0.091174 |
14 | MachineID | 0.012950 |
0 | YearMade | 0.001520 |
10 | Enclosure | 0.000430 |
5 | ModelID | 0.000395 |
This shows that there are three columns that differ significantly between the training and validation sets: saleElapsed
, SalesID
, and MachineID
. It’s fairly obvious why this is the case for saleElapsed
: it’s the number of days between the start of the dataset and each row, so it directly encodes the date. The difference in SalesID
suggests that identifiers for auction sales might increment over time. MachineID
suggests something similar might be happening for individual items sold in those auctions.
Let’s get a baseline of the original random forest model’s RMSE, then see what the effect is of removing each of these columns in turn:
In [90]:
m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))
for c in ('SalesID','saleElapsed','MachineID'):
m = rf(xs_final.drop(c,axis=1), y)
print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))
orig 0.232883
SalesID 0.230347
saleElapsed 0.235529
MachineID 0.230735
It looks like we should be able to remove SalesID
and MachineID
without losing any accuracy. Let’s check:
In [91]:
time_vars = ['SalesID','MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)
m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)
Out[91]:
0.229498
Removing these variables has slightly improved the model’s accuracy; but more importantly, it should make it more resilient over time, and easier to maintain and understand. We recommend that for all datasets you try building a model where your dependent variable is is_valid
, like we did here. It can often uncover subtle domain shift issues that you may otherwise miss.
One thing that might help in our case is to simply avoid using old data. Often, old data shows relationships that just aren’t valid any more. Let’s try just using the most recent few years of the data:
In [92]:
xs['saleYear'].hist();
Here’s the result of training on this subset:
In [93]:
filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]
In [94]:
m = rf(xs_filt, y_filt)
m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)
Out[94]:
(0.177284, 0.228008)
It’s a tiny bit better, which shows that you shouldn’t always just use your entire dataset; sometimes a subset can be better.
Let’s see if using a neural network helps.
Using a Neural Network
We can use the same approach to build a neural network model. Let’s first replicate the steps we took to set up the TabularPandas
object:
In [95]:
df_nn = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
df_nn[dep_var] = np.log(df_nn[dep_var])
df_nn = add_datepart(df_nn, 'saledate')
We can leverage the work we did to trim unwanted columns in the random forest by using the same set of columns for our neural network:
In [96]:
df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]
Categorical columns are handled very differently in neural networks, compared to decision tree approaches. As we saw in <>, in a neural net a great way to handle categorical variables is by using embeddings. To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the max_card
parameter. If it’s lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you’ve tested whether there are better ways to group the variable, so we’ll use 9,000 as our max_card
:
In [97]:
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
In this case, there’s one variable that we absolutely do not want to treat as categorical: the saleElapsed
variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Let’s verify that cont_cat_split
did the correct thing.
In [98]:
cont_nn
Out[98]:
['saleElapsed']
Let’s take a look at the cardinality of each of the categorical variables that we have chosen so far:
In [108]:
df_nn_final[cat_nn].nunique()
Out[108]:
YearMade 73
ProductSize 6
Coupler_System 2
fiProductClassDesc 74
Hydraulics_Flow 3
ModelID 5281
fiSecondaryDesc 177
fiModelDesc 5059
Enclosure 6
Hydraulics 12
ProductGroup 6
Drive_System 4
Tire_Size 17
dtype: int64
The fact that there are two variables pertaining to the “model” of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this when analyzing redundant features, since that relies on similar variables being sorted in the same order (that is, they need to have similarly named levels). Having a column with 5,000 levels means needing 5,000 columns in our embedding matrix, which would be nice to avoid if possible. Let’s see what the impact of removing one of these model columns has on the random forest:
In [109]:
xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
Out[109]:
(0.176713, 0.230195)
There’s minimal impact, so we will remove it as a predictor for our neural network:
In [111]:
cat_nn.remove('fiModelDescriptor')
We can create our TabularPandas
object in the same way as when we created our random forest, with one very important addition: normalization. A random forest does not need any normalization—the tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the Normalize
processor when we build our TabularPandas
object:
In [112]:
procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
splits=splits, y_names=dep_var)
Tabular models and data don’t generally require much GPU RAM, so we can use larger batch sizes:
In [113]:
dls = to_nn.dataloaders(1024)
As we’ve discussed, it’s a good idea to set y_range
for regression models, so let’s find the min and max of our dependent variable:
In [114]:
y = to_nn.train.y
y.min(),y.max()
Out[114]:
(8.465899467468262, 11.863582611083984)
We can now create the Learner
to create this tabular model. As usual, we use the application-specific learner function, to take advantage of its application-customized defaults. We set the loss function to MSE, since that’s what this competition uses.
By default, for tabular data fastai creates a neural network with two hidden layers, with 200 and 100 activations, respectively. This works quite well for small datasets, but here we’ve got quite a large dataset, so we increase the layer sizes to 500 and 250:
In [120]:
learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
n_out=1, loss_func=F.mse_loss)
In [116]:
learn.lr_find()
Out[116]:
SuggestedLRs(lr_min=0.002754228748381138, lr_steep=0.00015848931798245758)
There’s no need to use fine_tune
, so we’ll train with fit_one_cycle
for a few epochs and see how it looks:
In [121]:
learn.fit_one_cycle(5, 1e-2)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.068459 | 0.061185 | 00:09 |
1 | 0.056469 | 0.058471 | 00:09 |
2 | 0.048689 | 0.052404 | 00:09 |
3 | 0.044529 | 0.052138 | 00:09 |
4 | 0.040860 | 0.051236 | 00:09 |
We can use our r_mse
function to compare the result to the random forest result we got earlier:
In [122]:
preds,targs = learn.get_preds()
r_mse(preds,targs)
Out[122]:
0.226353
It’s quite a bit better than the random forest (although it took longer to train, and it’s fussier about hyperparameter tuning).
Before we move on, let’s save our model in case we want to come back to it again later:
In [123]:
learn.save('nn')
Out[123]:
Path('models/nn.pth')
Sidebar: fastai’s Tabular Classes
In fastai, a tabular model is simply a model that takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.
The model created in tabular_learner
is an object of class TabularModel
. Take a look at the source for tabular_learner
now (remember, that’s tabular_learner??
in Jupyter). You’ll see that like collab_learner
, it first calls get_emb_sz
to calculate appropriate embedding sizes (you can override these by using the emb_szs
parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the TabularModel
, and passes that to TabularLearner
(note that TabularLearner
is identical to Learner
, except for a customized predict
method).
That means that really all the work is happening in TabularModel
, so take a look at the source for that now. With the exception of the BatchNorm1d
and Dropout
layers (which we’ll be learning about shortly), you now have the knowledge required to understand this whole class. Take a look at the discussion of EmbeddingNN
at the end of the last chapter. Recall that it passed n_cont=0
to TabularModel
. We now can see why that was: because there are zero continuous variables (in fastai the n_
prefix means “number of,” and cont
is an abbreviation for “continuous”).
End sidebar
Another thing that can help with generalization is to use several models and average their predictions—a technique, as mentioned earlier, known as ensembling.