Multilayer RNNs
In a multilayer RNN, we pass the activations from our recurrent neural network into a second recurrent neural network, like in <>.
The unrolled representation is shown in <> (similar to <>).
Let’s see how to implement this in practice.
The Model
We can save some time by using PyTorch’s RNN
class, which implements exactly what we created earlier, but also gives us the option to stack multiple RNNs, as we have discussed:
In [ ]:
class LMModel5(Module):
def __init__(self, vocab_sz, n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = torch.zeros(n_layers, bs, n_hidden)
def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = h.detach()
return self.h_o(res)
def reset(self): self.h.zero_()
In [ ]:
learn = Learner(dls, LMModel5(len(vocab), 64, 2),
loss_func=CrossEntropyLossFlat(),
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 3.055853 | 2.591640 | 0.437907 | 00:01 |
1 | 2.162359 | 1.787310 | 0.471598 | 00:01 |
2 | 1.710663 | 1.941807 | 0.321777 | 00:01 |
3 | 1.520783 | 1.999726 | 0.312012 | 00:01 |
4 | 1.330846 | 2.012902 | 0.413249 | 00:01 |
5 | 1.163297 | 1.896192 | 0.450684 | 00:01 |
6 | 1.033813 | 2.005209 | 0.434814 | 00:01 |
7 | 0.919090 | 2.047083 | 0.456706 | 00:01 |
8 | 0.822939 | 2.068031 | 0.468831 | 00:01 |
9 | 0.750180 | 2.136064 | 0.475098 | 00:01 |
10 | 0.695120 | 2.139140 | 0.485433 | 00:01 |
11 | 0.655752 | 2.155081 | 0.493652 | 00:01 |
12 | 0.629650 | 2.162583 | 0.498535 | 00:01 |
13 | 0.613583 | 2.171649 | 0.491048 | 00:01 |
14 | 0.604309 | 2.180355 | 0.487874 | 00:01 |
Now that’s disappointing… our previous single-layer RNN performed better. Why? The reason is that we have a deeper model, leading to exploding or vanishing activations.
Exploding or Disappearing Activations
In practice, creating accurate models from this kind of RNN is difficult. We will get better results if we call detach
less often, and have more layers—this gives our RNN a longer time horizon to learn from, and richer features to create. But it also means we have a deeper model to train. The key challenge in the development of deep learning has been figuring out how to train these kinds of models.
The reason this is challenging is because of what happens when you multiply by a matrix many times. Think about what happens when you multiply by a number many times. For example, if you multiply by 2, starting at 1, you get the sequence 1, 2, 4, 8,… after 32 steps you are already at 4,294,967,296. A similar issue happens if you multiply by 0.5: you get 0.5, 0.25, 0.125… and after 32 steps it’s 0.00000000023. As you can see, multiplying by a number even slightly higher or lower than 1 results in an explosion or disappearance of our starting number, after just a few repeated multiplications.
Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And that’s all a deep neural network is —each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large or extremely small numbers.
This is a problem, because the way computers store numbers (known as “floating point”) means that they become less and less accurate the further away the numbers get from zero. The diagram in <>, from the excellent article “What You Never Wanted to Know About Floating Point but Will Be Forced to Find Out”, shows how the precision of floating-point numbers varies over the number line.
This inaccuracy means that often the gradients calculated for updating the weights end up as zero or infinity for deep networks. This is commonly referred to as the vanishing gradients or exploding gradients problem. It means that in SGD, the weights are either not updated at all or jump to infinity. Either way, they won’t improve with training.
Researchers have developed a number of ways to tackle this problem, which we will be discussing later in the book. One option is to change the definition of a layer in a way that makes it less likely to have exploding activations. We’ll look at the details of how this is done in <>, when we discuss batch normalization, and <>, when we discuss ResNets, although these details don’t generally matter in practice (unless you are a researcher that is creating new approaches to solving this problem). Another strategy for dealing with this is by being careful about initialization, which is a topic we’ll investigate in <>.
For RNNs, there are two types of layers that are frequently used to avoid exploding activations: gated recurrent units (GRUs) and long short-term memory (LSTM) layers. Both of these are available in PyTorch, and are drop-in replacements for the RNN layer. We will only cover LSTMs in this book; there are plenty of good tutorials online explaining GRUs, which are a minor variant on the LSTM design.