Mixup
Mixup, introduced in the 2017 paper “mixup: Beyond Empirical Risk Minimization” by Hongyi Zhang et al., is a very powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don’t have much data and don’t have a pretrained model that was trained on data similar to your dataset. The paper explains: “While data augmentation consistently leads to improved generalization, the procedure is dataset-dependent, and thus requires the use of expert knowledge.” For instance, it’s common to flip images as part of data augmentation, but should you flip only horizontally, or also vertically? The answer is that it depends on your dataset. In addition, if flipping (for instance) doesn’t provide enough data augmentation for you, you can’t “flip more.” It’s helpful to have data augmentation techniques where you can “dial up” or “dial down” the amount of change, to see what works best for you.
Mixup works as follows, for each image:
- Select another image from your dataset at random.
- Pick a weight at random.
- Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable.
- Take a weighted average (with the same weight) of this image’s labels with your image’s labels; this will be your dependent variable.
In pseudocode, we’re doing this (where t
is the weight for our weighted average):
image2,target2 = dataset[randint(0,len(dataset)]
t = random_float(0.5,1.0)
new_image = t * image1 + (1-t) * image2
new_target = t * target1 + (1-t) * target2
For this to work, our targets need to be one-hot encoded. The paper describes this using the equations shown in <> where $\lambda$ is the same as t
in our pseudocode:
Sidebar: Papers and Math
We’re going to be looking at more and more research papers from here on in the book. Now that you have the basic jargon, you might be surprised to discover how much of them you can understand, with a little practice! One issue you’ll notice is that Greek letters, such as $\lambda$, appear in most papers. It’s a very good idea to learn the names of all the Greek letters, since otherwise it’s very hard to read the papers to yourself, and remember them (or to read code based on them, since code often uses the names of the Greek letters spelled out, such as lambda
).
The bigger issue with papers is that they use math, instead of code, to explain what’s going on. If you don’t have much of a math background, this will likely be intimidating and confusing at first. But remember: what is being shown in the math, is something that will be implemented in code. It’s just another way of talking about the same thing! After reading a few papers, you’ll pick up more and more of the notation. If you don’t know what a symbol is, try looking it up in Wikipedia’s list of mathematical symbols or drawing it in Detexify, which (using machine learning!) will find the name of your hand-drawn symbol. Then you can search online for that name to find out what it’s for.
End sidebar
<> shows what it looks like when we take a linear combination of images, as done in Mixup.
In [ ]:
#hide_input
#id mixup_example
#caption Mixing a church and a gas station
#alt An image of a church, a gas station and the two mixed up.
church = PILImage.create(get_image_files_sorted(path/'train'/'n03028079')[0])
gas = PILImage.create(get_image_files_sorted(path/'train'/'n03425413')[0])
church = church.resize((256,256))
gas = gas.resize((256,256))
tchurch = tensor(church).float() / 255.
tgas = tensor(gas).float() / 255.
_,axs = plt.subplots(1, 3, figsize=(12,4))
show_image(tchurch, ax=axs[0]);
show_image(tgas, ax=axs[1]);
show_image((0.3*tchurch + 0.7*tgas), ax=axs[2]);
The third image is built by adding 0.3 times the first one and 0.7 times the second. In this example, should the model predict “church” or “gas station”? The right answer is 30% church and 70% gas station, since that’s what we’ll get if we take the linear combination of the one-hot-encoded targets. For instance, suppose we have 10 classes and “church” is represented by the index 2 and “gas station” is reprsented by the index 7, the one-hot-encoded representations are:
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
so our final target is:
[0, 0, 0.3, 0, 0, 0, 0, 0.7, 0, 0]
This all done for us inside fastai by adding a callback to our Learner
. Callback
s are what is used inside fastai to inject custom behavior in the training loop (like a learning rate schedule, or training in mixed precision). We’ll be learning all about callbacks, including how to make your own, in <>. For now, all you need to know is that you use the cbs
parameter to Learner
to pass callbacks.
Here is how we train a model with Mixup:
model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
metrics=accuracy, cbs=MixUp())
learn.fit_one_cycle(5, 3e-3)
What happens when we train a model with data that’s “mixed up” in this way? Clearly, it’s going to be harder to train, because it’s harder to see what’s in each image. And the model has to predict two labels per image, rather than just one, as well as figuring out how much each one is weighted. Overfitting seems less likely to be a problem, however, because we’re not showing the same image in each epoch, but are instead showing a random combination of two images.
Mixup requires far more epochs to train to get better accuracy, compared to other augmentation approaches we’ve seen. You can try training Imagenette with and without Mixup by using the examples/train_imagenette.py script in the fastai repo. At the time of writing, the leaderboard in the Imagenette repo is showing that Mixup is used for all leading results for trainings of >80 epochs, and for fewer epochs Mixup is not being used. This is in line with our experience of using Mixup too.
One of the reasons that Mixup is so exciting is that it can be applied to types of data other than photos. In fact, some people have even shown good results by using Mixup on activations inside their models, not just on inputs—this allows Mixup to be used for NLP and other data types too.
There’s another subtle issue that Mixup deals with for us, which is that it’s not actually possible with the models we’ve seen before for our loss to ever be perfect. The problem is that our labels are 1s and 0s, but the outputs of softmax and sigmoid can never equal 1 or 0. This means training our model pushes our activations ever closer to those values, such that the more epochs we do, the more extreme our activations become.
With Mixup we no longer have that problem, because our labels will only be exactly 1 or 0 if we happen to “mix” with another image of the same class. The rest of the time our labels will be a linear combination, such as the 0.7 and 0.3 we got in the church and gas station example earlier.
One issue with this, however, is that Mixup is “accidentally” making the labels bigger than 0, or smaller than 1. That is to say, we’re not explicitly telling our model that we want to change the labels in this way. So, if we want to make the labels closer to, or further away from 0 and 1, we have to change the amount of Mixup—which also changes the amount of data augmentation, which might not be what we want. There is, however, a way to handle this more directly, which is to use label smoothing.