Training a Text Classifier
As we saw at the beginning of this chapter, there are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.
As usual, let’s start with assembling our data.
Language Model Using DataBlock
fastai handles tokenization and numericalization automatically when TextBlock
is passed to DataBlock
. All of the arguments that can be passed to Tokenize
and Numericalize
can also be passed to TextBlock
. In the next chapter we’ll discuss the easiest ways to run each of these steps separately, to ease debugging—but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don’t forget about DataBlock
‘s handy summary
method, which is very useful for debugging data issues.
Here’s how we use TextBlock
to create a language model, using fastai’s defaults:
In [ ]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
One thing that’s different to previous types we’ve used in DataBlock
is that we’re not just using the class directly (i.e., TextBlock(...)
, but instead are calling a class method. A class method is a Python method that, as the name suggests, belongs to a class rather than an object. (Be sure to search online for more information about class methods if you’re not familiar with them, since they’re commonly used in many Python libraries and applications; we’ve used them a few times previously in the book, but haven’t called attention to them.) The reason that TextBlock
is special is that setting up the numericalizer’s vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible it performs a few optimizations:
- It saves the tokenized documents in a temporary folder, so it doesn’t have to tokenize them more than once
- It runs multiple tokenization processes in parallel, to take advantage of your computer’s CPUs
We need to tell TextBlock
how to access the texts, so that it can do this initial preprocessing—that’s what from_folder
does.
show_batch
then works in the usual way:
In [ ]:
dls_lm.show_batch(max_n=2)
text | text_ | |
---|---|---|
0 | xxbos xxmaj it ‘s awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard | xxmaj it ‘s awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard xxunk |
1 | what xxmaj i ‘ve read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \n\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this | xxmaj i ‘ve read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \n\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this is |
Now that our data is ready, we can fine-tune the pretrained language model.
Fine-Tuning the Language Model
To convert the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modeling. Then we’ll feed those embeddings into a recurrent neural network (RNN), using an architecture called AWD-LSTM (we will show you how to write such a model from scratch in <>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren’t in the pretraining vocabulary. This is handled automatically inside language_model_learner
:
In [ ]:
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The perplexity metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., torch.exp(cross_entropy)
). We also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we’ve seen) is both hard to interpret, and tells us more about the model’s confidence than its accuracy.
Let’s go back to the process diagram from the beginning of this chapter. The first arrow has been completed for us and made available as a pretrained model in fastai, and we’ve just built the DataLoaders
and Learner
for the second stage. Now we’re ready to fine-tune our language model!
It takes quite a while to train each epoch, so we’ll be saving the intermediate model results during the training process. Since fine_tune
doesn’t do that for us, we’ll use fit_one_cycle
. Just like cnn_learner
, language_model_learner
automatically calls freeze
when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights—i.e., embeddings for words that are in our IMDb vocab, but aren’t in the pretrained model vocab):
In [ ]:
learn.fit_one_cycle(1, 2e-2)
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 4.120048 | 3.912788 | 0.299565 | 50.038246 | 11:39 |
This model takes a while to train, so it’s a good opportunity to talk about saving intermediary results.
Saving and Loading Models
You can easily save the state of your model like so:
In [ ]:
learn.save('1epoch')
This will create a file in learn.path/models/
named 1epoch.pth. If you want to load your model in another machine after creating your Learner
the same way, or resume training later, you can load the content of this file with:
In [ ]:
learn = learn.load('1epoch')
Once the initial training has completed, we can continue fine-tuning the model after unfreezing:
In [ ]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 3.893486 | 3.772820 | 0.317104 | 43.502548 | 12:37 |
1 | 3.820479 | 3.717197 | 0.323790 | 41.148880 | 12:30 |
2 | 3.735622 | 3.659760 | 0.330321 | 38.851997 | 12:09 |
3 | 3.677086 | 3.624794 | 0.333960 | 37.516987 | 12:12 |
4 | 3.636646 | 3.601300 | 0.337017 | 36.645859 | 12:05 |
5 | 3.553636 | 3.584241 | 0.339355 | 36.026001 | 12:04 |
6 | 3.507634 | 3.571892 | 0.341353 | 35.583862 | 12:08 |
7 | 3.444101 | 3.565988 | 0.342194 | 35.374371 | 12:08 |
8 | 3.398597 | 3.566283 | 0.342647 | 35.384815 | 12:11 |
9 | 3.375563 | 3.568166 | 0.342528 | 35.451500 | 12:05 |
Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder. We can save it with save_encoder
:
In [ ]:
learn.save_encoder('finetuned')
jargon: Encoder: The model not including the task-specific final layer(s). This term means much the same thing as body when applied to vision CNNs, but “encoder” tends to be more used for NLP and generative models.
This completes the second stage of the text classification process: fine-tuning the language model. We can now use it to fine-tune a classifier using the IMDb sentiment labels.
Text Generation
Before we move on to fine-tuning the classifier, let’s quickly try something different: using our model to generate random reviews. Since it’s trained to guess what the next word of the sentence is, we can use the model to write new reviews:
In [ ]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]
In [ ]:
print("\n".join(preds))
i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story
i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the " evil " machine has to be used to protect
As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so we don’t get exactly the same review twice. Our model doesn’t have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalizes properly (I is just transformed to i because our rules require two characters or more to consider a word as capitalized, so it’s normal to see it lowercased) and is using consistent tense. The general review makes sense at first glance, and it’s only if you read carefully that you can notice something is a bit off. Not bad for a model trained in a couple of hours!
But our end goal wasn’t to train a model to generate reviews, but to classify them… so let’s use this model to do just that.
Creating the Classifier DataLoaders
We’re now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn’t need any external labels. A classifier, however, predicts some external label—in the case of IMDb, it’s the sentiment of a document.
This means that the structure of our DataBlock
for NLP classification will look very familiar. It’s actually nearly the same as we’ve seen for the many image classification datasets we’ve worked with:
In [ ]:
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
Just like with image classification, show_batch
shows the dependent variable (sentiment, in this case) with each independent variable (movie review text):
In [ ]:
dls_clas.show_batch(max_n=3)
text | category | |
---|---|---|
0 | xxbos i rate this movie with 3 skulls , only coz the girls knew how to scream , this could ‘ve been a better movie , if actors were better , the twins were xxup ok , i believed they were evil , but the eldest and youngest brother , they sucked really bad , it seemed like they were reading the scripts instead of acting them … . spoiler : if they ‘re vampire ‘s why do they freeze the blood ? vampires ca n’t drink frozen blood , the sister in the movie says let ‘s drink her while she is alive … .but then when they ‘re moving to another house , they take on a cooler they ‘re frozen blood . end of spoiler \n\n it was a huge waste of time , and that made me mad coz i read all the reviews of how | neg |
1 | xxbos i have read all of the xxmaj love xxmaj come xxmaj softly books . xxmaj knowing full well that movies can not use all aspects of the book , but generally they at least have the main point of the book . i was highly disappointed in this movie . xxmaj the only thing that they have in this movie that is in the book is that xxmaj missy ‘s father comes to xxunk in the book both parents come ) . xxmaj that is all . xxmaj the story line was so twisted and far fetch and yes , sad , from the book , that i just could n’t enjoy it . xxmaj even if i did n’t read the book it was too sad . i do know that xxmaj pioneer life was rough , but the whole movie was a downer . xxmaj the rating | neg |
2 | xxbos xxmaj this , for lack of a better term , movie is lousy . xxmaj where do i start … … \n\n xxmaj cinemaphotography - xxmaj this was , perhaps , the worst xxmaj i ‘ve seen this year . xxmaj it looked like the camera was being tossed from camera man to camera man . xxmaj maybe they only had one camera . xxmaj it gives you the sensation of being a volleyball . \n\n xxmaj there are a bunch of scenes , haphazardly , thrown in with no continuity at all . xxmaj when they did the ‘ split screen ‘ , it was absurd . xxmaj everything was squished flat , it looked ridiculous . \n\n xxmaj the color tones were way off . xxmaj these people need to learn how to balance a camera . xxmaj this ‘ movie ‘ is poorly made , and | neg |
Looking at the DataBlock
definition, every piece is familiar from previous data blocks we’ve built, with two important exceptions:
TextBlock.from_folder
no longer has theis_lm=True
parameter.- We pass the
vocab
we created for the language model fine-tuning.
The reason that we pass the vocab
of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won’t make any sense to this model, and the fine-tuning step won’t be of any use.
By passing is_lm=False
(or not passing is_lm
at all, since it defaults to False
) we tell TextBlock
that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a mini-batch. Let’s see with an example, by trying to create a mini-batch containing the first 10 documents. First we’ll numericalize them:
In [ ]:
nums_samp = toks200[:10].map(num)
Let’s now look at how many tokens each of these 10 movie reviews have:
In [ ]:
nums_samp.map(len)
Out[ ]:
(#10) [228,238,121,290,196,194,533,124,581,155]
Remember, PyTorch DataLoader
s need to collate all the items in a batch into a single tensor, and a single tensor has a fixed shape (i.e., it has some particular length on every axis, and all items must be consistent). This should sound familiar: we had the same issue with images. In that case, we used cropping, padding, and/or squishing to make all the inputs the same size. Cropping might not be a good idea for documents, because it seems likely we’d remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn’t been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!). You can’t really “squish” a document. So that leaves padding!
We will expand the shortest texts to make them all the same size. To do this, we use a special padding token that will be ignored by our model. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won’t pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, but at the time of writing no library provides good support for this yet, and there aren’t any papers covering it. It’s something we’re planning to add to fastai soon, however, so keep an eye on the book’s website; we’ll add information about this as soon as we have it working well.)
The sorting and padding are automatically done by the data block API for us when using a TextBlock
, with is_lm=False
. (We don’t have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)
We can now create a model to classify our texts:
In [ ]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use load_encoder
instead of load
because we only have pretrained weights available for the encoder; load
by default raises an exception if an incomplete model is loaded:
In [ ]:
learn = learn.load_encoder('finetuned')
Fine-Tuning the Classifier
The last step is to train with discriminative learning rates and gradual unfreezing. In computer vision we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference:
In [ ]:
learn.fit_one_cycle(1, 2e-2)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.347427 | 0.184480 | 0.929320 | 00:33 |
In just one epoch we get the same result as our training in <>: not too bad! We can pass -2
to freeze_to
to freeze all except the last two parameter groups:
In [ ]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.247763 | 0.171683 | 0.934640 | 00:37 |
Then we can unfreeze a bit more, and continue training:
In [ ]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.193377 | 0.156696 | 0.941200 | 00:45 |
And finally, the whole model!
In [ ]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.172888 | 0.153770 | 0.943120 | 01:01 |
1 | 0.161492 | 0.155567 | 0.942640 | 00:57 |
We reached 94.3% accuracy, which was state-of-the-art performance just three years ago. By training another model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, by fine-tuning a much bigger model and using expensive data augmentation techniques (translating sentences in another language and back, using another model for translation).
Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. This is exciting stuff, but it’s good to remember that this technology can also be used for malign purposes.