09 Tabular Modeling Deep Dive - Extrapolation and Neural Networks - 《The fastai book》

Extrapolation and Neural Networks

Extrapolation and Neural Networks

A problem with random forests, like all machine learning or deep learning algorithms, is that they don’t always generalize well to new data. We will see in which situations neural networks generalize better, but first, let’s look at the extrapolation problem that random forests have.

The Extrapolation Problem

In [83]:

#hide
np.random.seed(42)

Let’s consider the simple task of making predictions from 40 data points showing a slightly noisy linear relationship:

In [84]:

x_lin = torch.linspace(0,20, steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
plt.scatter(x_lin, y_lin);

Although we only have a single independent variable, sklearn expects a matrix of independent variables, not a single vector. So we have to turn our vector into a matrix with one column. In other words, we have to change the shape from [40] to [40,1]. One way to do that is with the unsqueeze method, which adds a new unit axis to a tensor at the requested dimension:

In [85]:

xs_lin = x_lin.unsqueeze(1)
x_lin.shape,xs_lin.shape

Out[85]:

(torch.Size([40]), torch.Size([40, 1]))

A more flexible approach is to slice an array or tensor with the special value None, which introduces an additional unit axis at that location:

In [86]:

x_lin[:,None].shape

Out[86]:

torch.Size([40, 1])

We can now create a random forest for this data. We’ll use only the first 30 rows to train the model:

In [87]:

m_lin = RandomForestRegressor().fit(xs_lin[:30],y_lin[:30])

Then we’ll test the model on the full dataset. The blue dots are the training data, and the red dots are the predictions:

In [88]:

plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);

We have a big problem! Our predictions outside of the domain that our training data covered are all too low. Why do you suppose this is?

Remember, a random forest just averages the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time. Your predictions will be systematically too low.

But the problem extends beyond time variables. Random forests are not able to extrapolate outside of the types of data they have seen, in a more general sense. That’s why we need to make sure our validation set does not contain out-of-domain data.

Finding Out-of-Domain Data

Sometimes it is hard to know whether your test set is distributed in the same way as your training data, or, if it is different, what columns reflect that difference. There’s actually an easy way to figure this out, which is to use a random forest!

But in this case we don’t use the random forest to predict our actual dependent variable. Instead, we try to predict whether a row is in the validation set or the training set. To see this in action, let’s combine our training and validation sets together, create a dependent variable that represents which dataset each row comes from, build a random forest using that data, and get its feature importance:

In [89]:

df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))
m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]

Out[89]:

	cols	imp
6	saleElapsed	0.891571
9	SalesID	0.091174
14	MachineID	0.012950
0	YearMade	0.001520
10	Enclosure	0.000430
5	ModelID	0.000395

This shows that there are three columns that differ significantly between the training and validation sets: saleElapsed, SalesID, and MachineID. It’s fairly obvious why this is the case for saleElapsed: it’s the number of days between the start of the dataset and each row, so it directly encodes the date. The difference in SalesID suggests that identifiers for auction sales might increment over time. MachineID suggests something similar might be happening for individual items sold in those auctions.

Let’s get a baseline of the original random forest model’s RMSE, then see what the effect is of removing each of these columns in turn:

In [90]:

m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))
for c in ('SalesID','saleElapsed','MachineID'):
    m = rf(xs_final.drop(c,axis=1), y)
    print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))

orig 0.232883
SalesID 0.230347
saleElapsed 0.235529
MachineID 0.230735

It looks like we should be able to remove SalesID and MachineID without losing any accuracy. Let’s check:

In [91]:

time_vars = ['SalesID','MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)
m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)

Out[91]:

0.229498

Removing these variables has slightly improved the model’s accuracy; but more importantly, it should make it more resilient over time, and easier to maintain and understand. We recommend that for all datasets you try building a model where your dependent variable is is_valid, like we did here. It can often uncover subtle domain shift issues that you may otherwise miss.

One thing that might help in our case is to simply avoid using old data. Often, old data shows relationships that just aren’t valid any more. Let’s try just using the most recent few years of the data:

In [92]:

xs['saleYear'].hist();

Here’s the result of training on this subset:

In [93]:

filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]

In [94]:

m = rf(xs_filt, y_filt)
m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)

Out[94]:

(0.177284, 0.228008)

It’s a tiny bit better, which shows that you shouldn’t always just use your entire dataset; sometimes a subset can be better.

Let’s see if using a neural network helps.

Using a Neural Network

We can use the same approach to build a neural network model. Let’s first replicate the steps we took to set up the TabularPandas object:

In [95]:

df_nn = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
df_nn[dep_var] = np.log(df_nn[dep_var])
df_nn = add_datepart(df_nn, 'saledate')

We can leverage the work we did to trim unwanted columns in the random forest by using the same set of columns for our neural network:

In [96]:

df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]

Categorical columns are handled very differently in neural networks, compared to decision tree approaches. As we saw in <>, in a neural net a great way to handle categorical variables is by using embeddings. To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the max_card parameter. If it’s lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you’ve tested whether there are better ways to group the variable, so we’ll use 9,000 as our max_card:

In [97]:

cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

In this case, there’s one variable that we absolutely do not want to treat as categorical: the saleElapsed variable. A categorical variable cannot, by definition, extrapolate outside the range of values that it has seen, but we want to be able to predict auction sale prices in the future. Let’s verify that cont_cat_split did the correct thing.

In [98]:

cont_nn

Out[98]:

['saleElapsed']

Let’s take a look at the cardinality of each of the categorical variables that we have chosen so far:

In [108]:

df_nn_final[cat_nn].nunique()

Out[108]:

YearMade                73
ProductSize              6
Coupler_System           2
fiProductClassDesc      74
Hydraulics_Flow          3
ModelID               5281
fiSecondaryDesc        177
fiModelDesc           5059
Enclosure                6
Hydraulics              12
ProductGroup             6
Drive_System             4
Tire_Size               17
dtype: int64

The fact that there are two variables pertaining to the “model” of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this when analyzing redundant features, since that relies on similar variables being sorted in the same order (that is, they need to have similarly named levels). Having a column with 5,000 levels means needing 5,000 columns in our embedding matrix, which would be nice to avoid if possible. Let’s see what the impact of removing one of these model columns has on the random forest:

In [109]:

xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)

Out[109]:

(0.176713, 0.230195)

There’s minimal impact, so we will remove it as a predictor for our neural network:

In [111]:

cat_nn.remove('fiModelDescriptor')

We can create our TabularPandas object in the same way as when we created our random forest, with one very important addition: normalization. A random forest does not need any normalization—the tree building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the Normalize processor when we build our TabularPandas object:

In [112]:

procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
                      splits=splits, y_names=dep_var)

Tabular models and data don’t generally require much GPU RAM, so we can use larger batch sizes:

In [113]:

dls = to_nn.dataloaders(1024)

As we’ve discussed, it’s a good idea to set y_range for regression models, so let’s find the min and max of our dependent variable:

In [114]:

y = to_nn.train.y
y.min(),y.max()

Out[114]:

(8.465899467468262, 11.863582611083984)

We can now create the Learner to create this tabular model. As usual, we use the application-specific learner function, to take advantage of its application-customized defaults. We set the loss function to MSE, since that’s what this competition uses.

By default, for tabular data fastai creates a neural network with two hidden layers, with 200 and 100 activations, respectively. This works quite well for small datasets, but here we’ve got quite a large dataset, so we increase the layer sizes to 500 and 250:

In [120]:

learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
                        n_out=1, loss_func=F.mse_loss)

In [116]:

learn.lr_find()

Out[116]:

SuggestedLRs(lr_min=0.002754228748381138, lr_steep=0.00015848931798245758)

There’s no need to use fine_tune, so we’ll train with fit_one_cycle for a few epochs and see how it looks:

In [121]:

learn.fit_one_cycle(5, 1e-2)

epoch	train_loss	valid_loss	time
0	0.068459	0.061185	00:09
1	0.056469	0.058471	00:09
2	0.048689	0.052404	00:09
3	0.044529	0.052138	00:09
4	0.040860	0.051236	00:09

We can use our r_mse function to compare the result to the random forest result we got earlier:

In [122]:

preds,targs = learn.get_preds()
r_mse(preds,targs)

Out[122]:

0.226353

It’s quite a bit better than the random forest (although it took longer to train, and it’s fussier about hyperparameter tuning).

Before we move on, let’s save our model in case we want to come back to it again later:

In [123]:

learn.save('nn')

Out[123]:

Path('models/nn.pth')

Sidebar: fastai’s Tabular Classes

In fastai, a tabular model is simply a model that takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.

The model created in tabular_learner is an object of class TabularModel. Take a look at the source for tabular_learner now (remember, that’s tabular_learner?? in Jupyter). You’ll see that like collab_learner, it first calls get_emb_sz to calculate appropriate embedding sizes (you can override these by using the emb_szs parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the TabularModel, and passes that to TabularLearner (note that TabularLearner is identical to Learner, except for a customized predict method).

That means that really all the work is happening in TabularModel, so take a look at the source for that now. With the exception of the BatchNorm1d and Dropout layers (which we’ll be learning about shortly), you now have the knowledge required to understand this whole class. Take a look at the discussion of EmbeddingNN at the end of the last chapter. Recall that it passed n_cont=0 to TabularModel. We now can see why that was: because there are zero continuous variables (in fastai the n_ prefix means “number of,” and cont is an abbreviation for “continuous”).

End sidebar

Another thing that can help with generalization is to use several models and average their predictions—a technique, as mentioned earlier, known as ensembling.