Model Interpretation
It’s very hard to interpret loss functions directly, because they are designed to be things computers can differentiate and optimize, not things that people can understand. That’s why we have metrics. These are not used in the optimization process, but just to help us poor humans understand what’s going on. In this case, our accuracy is looking pretty good already! So where are we making mistakes?
We saw in <> that we can use a confusion matrix to see where our model is doing well, and where it’s doing badly:
In [ ]:
#width 600
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)
Oh dear—in this case, a confusion matrix is very hard to read. We have 37 different breeds of pet, which means we have 37×37 entries in this giant matrix! Instead, we can use the most_confused
method, which just shows us the cells of the confusion matrix with the most incorrect predictions (here, with at least 5 or more):
In [ ]:
interp.most_confused(min_val=5)
Out[ ]:
[('american_pit_bull_terrier', 'staffordshire_bull_terrier', 10),
('Ragdoll', 'Birman', 8),
('Siamese', 'Birman', 6),
('Bengal', 'Egyptian_Mau', 5),
('american_pit_bull_terrier', 'american_bulldog', 5)]
Since we are not pet breed experts, it is hard for us to know whether these category errors reflect actual difficulties in recognizing breeds. So again, we turn to Google. A little bit of Googling tells us that the most common category errors shown here are actually breed differences that even expert breeders sometimes disagree about. So this gives us some comfort that we are on the right track.
We seem to have a good baseline. What can we do now to make it even better?