clean
If you hit a “CUDA out of memory error” after running this cell, click on the menu Kernel, then restart. Instead of executing the cell above, copy and paste the following code in it:
from fastai.text.all import *
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test', bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)
This reduces the batch size to 32 (we will explain this later). If you keep hitting the same error, change 32 to 16.
This model is using the “IMDb Large Movie Review dataset” from the paper “Learning Word Vectors for Sentiment Analysis” by Andrew Maas et al. It works well with movie reviews of many thousands of words, but let’s test it out on a very short one to see how it does its thing:
In [ ]:
learn.predict("I really liked that movie!")
Out[ ]:
('neg', tensor(0), tensor([0.8786, 0.1214]))
Here we can see the model has considered the review to be positive. The second part of the result is the index of “pos” in our data vocabulary and the last part is the probabilities attributed to each class (99.6% for “pos” and 0.4% for “neg”).
Now it’s your turn! Write your own mini movie review, or copy one from the internet, and you can see what this model thinks about it.
Sidebar: The Order Matters
In a Jupyter notebook, the order in which you execute each cell is very important. It’s not like Excel, where everything gets updated as soon as you type something anywhere—it has an inner state that gets updated each time you execute a cell. For instance, when you run the first cell of the notebook (with the “CLICK ME” comment), you create an object called learn
that contains a model and data for an image classification problem. If we were to run the cell just shown in the text (the one that predicts if a review is good or not) straight after, we would get an error as this learn
object does not contain a text classification model. This cell needs to be run after the one containing:
from fastai.text.all import *
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5,
metrics=accuracy)
learn.fine_tune(4, 1e-2)
The outputs themselves can be deceiving, because they include the results of the last time the cell was executed; if you change the code inside a cell without executing it, the old (misleading) results will remain.
Except when we mention it explicitly, the notebooks provided on the book website are meant to be run in order, from top to bottom. In general, when experimenting, you will find yourself executing cells in any order to go fast (which is a super neat feature of Jupyter Notebook), but once you have explored and arrived at the final version of your code, make sure you can run the cells of your notebooks in order (your future self won’t necessarily remember the convoluted path you took otherwise!).
In command mode, pressing 0
twice will restart the kernel (which is the engine powering your notebook). This will wipe your state clean and make it as if you had just started in the notebook. Choose Run All Above from the Cell menu to run all cells above the point where you are. We have found this to be very useful when developing the fastai library.
End sidebar
If you ever have any questions about a fastai method, you should use the function doc
, passing it the method name:
doc(learn.predict)
This will make a small window pop up with content like this:
A brief one-line explanation is provided by doc
. The “Show in docs” link takes you to the full documentation, where you’ll find all the details and lots of examples. Also, most of fastai’s methods are just a handful of lines, so you can click the “source” link to see exactly what’s going on behind the scenes.
Let’s move on to something much less sexy, but perhaps significantly more widely commercially useful: building models from plain tabular data.
jargon: Tabular: Data that is in the form of a table, such as from a spreadsheet, database, or CSV file. A tabular model is a model that tries to predict one column of a table based on information in other columns of the table.
It turns out that looks very similar too. Here is the code necessary to train a model that will predict whether a person is a high-income earner, based on their socioeconomic background:
In [ ]:
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])
learn = tabular_learner(dls, metrics=accuracy)
As you see, we had to tell fastai which columns are categorical (that is, contain values that are one of a discrete set of choices, such as occupation
) and which are continuous (that is, contain a number that represents a quantity, such as age
).
There is no pretrained model available for this task (in general, pretrained models are not widely available for any tabular modeling tasks, although some organizations have created them for internal use), so we don’t use fine_tune
in this case. Instead we use fit_one_cycle
, the most commonly used method for training fastai models from scratch (i.e. without transfer learning):
In [ ]:
learn.fit_one_cycle(3)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.372397 | 0.357177 | 0.832463 | 00:08 |
1 | 0.351544 | 0.341505 | 0.841523 | 00:08 |
2 | 0.338763 | 0.339184 | 0.845670 | 00:08 |
This model is using the Adult dataset, from the paper “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid” by Rob Kohavi, which contains some demographic data about individuals (like their education, marital status, race, sex, and whether or not they have an annual income greater than \$50k). The model is over 80\% accurate, and took around 30 seconds to train.
Let’s look at one more. Recommendation systems are very important, particularly in e-commerce. Companies like Amazon and Netflix try hard to recommend products or movies that users might like. Here’s how to train a model that will predict movies people might like, based on their previous viewing habits, using the MovieLens dataset:
In [ ]:
from fastai.collab import *
path = untar_data(URLs.ML_SAMPLE)
dls = CollabDataLoaders.from_csv(path/'ratings.csv')
learn = collab_learner(dls, y_range=(0.5,5.5))
learn.fine_tune(10)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.510897 | 1.410028 | 00:00 |
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.375435 | 1.350930 | 00:00 |
1 | 1.270062 | 1.173962 | 00:00 |
2 | 1.023159 | 0.879298 | 00:00 |
3 | 0.797398 | 0.739787 | 00:00 |
4 | 0.685500 | 0.700903 | 00:00 |
5 | 0.646508 | 0.686387 | 00:00 |
6 | 0.623985 | 0.681087 | 00:00 |
7 | 0.606319 | 0.676885 | 00:00 |
8 | 0.606975 | 0.675833 | 00:00 |
9 | 0.602670 | 0.675682 | 00:00 |
This model is predicting movie ratings on a scale of 0.5 to 5.0 to within around 0.6 average error. Since we’re predicting a continuous number, rather than a category, we have to tell fastai what range our target has, using the y_range
parameter.
Although we’re not actually using a pretrained model (for the same reason that we didn’t for the tabular model), this example shows that fastai lets us use fine_tune
anyway in this case (you’ll learn how and why this works in <>). Sometimes it’s best to experiment with fine_tune
versus fit_one_cycle
to see which works best for your dataset.
We can use the same show_results
call we saw earlier to view a few examples of user and movie IDs, actual ratings, and predictions:
In [ ]:
learn.show_results()
userId | movieId | rating | rating_pred | |
---|---|---|---|---|
0 | 66.0 | 79.0 | 4.0 | 3.978900 |
1 | 97.0 | 15.0 | 4.0 | 3.851795 |
2 | 55.0 | 79.0 | 3.5 | 3.945623 |
3 | 98.0 | 91.0 | 4.0 | 4.458704 |
4 | 53.0 | 7.0 | 5.0 | 4.670005 |
5 | 26.0 | 69.0 | 5.0 | 4.319870 |
6 | 81.0 | 16.0 | 4.5 | 4.426761 |
7 | 80.0 | 7.0 | 4.0 | 4.046183 |
8 | 51.0 | 94.0 | 5.0 | 3.499996 |
Sidebar: Datasets: Food for Models
You’ve already seen quite a few models in this section, each one trained using a different dataset to do a different task. In machine learning and deep learning, we can’t do anything without data. So, the people that create datasets for us to train our models on are the (often underappreciated) heroes. Some of the most useful and important datasets are those that become important academic baselines; that is, datasets that are widely studied by researchers and used to compare algorithmic changes. Some of these become household names (at least, among households that train models!), such as MNIST, CIFAR-10, and ImageNet.
The datasets used in this book have been selected because they provide great examples of the kinds of data that you are likely to encounter, and the academic literature has many examples of model results using these datasets to which you can compare your work.
Most datasets used in this book took the creators a lot of work to build. For instance, later in the book we’ll be showing you how to create a model that can translate between French and English. The key input to this is a French/English parallel text corpus prepared back in 2009 by Professor Chris Callison-Burch of the University of Pennsylvania. This dataset contains over 20 million sentence pairs in French and English. He built the dataset in a really clever way: by crawling millions of Canadian web pages (which are often multilingual) and then using a set of simple heuristics to transform URLs of French content onto URLs pointing to the same content in English.
As you look at datasets throughout this book, think about where they might have come from, and how they might have been curated. Then think about what kinds of interesting datasets you could create for your own projects. (We’ll even take you step by step through the process of creating your own image dataset soon.)
fast.ai has spent a lot of time creating cut-down versions of popular datasets that are specially designed to support rapid prototyping and experimentation, and to be easier to learn with. In this book we will often start by using one of the cut-down versions and later scale up to the full-size version (just as we’re doing in this chapter!). In fact, this is how the world’s top practitioners do their modeling in practice; they do most of their experimentation and prototyping with subsets of their data, and only use the full dataset when they have a good understanding of what they have to do.
End sidebar
Each of the models we trained showed a training and validation loss. A good validation set is one of the most important pieces of the training process. Let’s see why and learn how to create one.