Data
Have a look at the source to untar_data
to see how it works. We’ll use it here to access the 160-pixel version of Imagenette for use in this chapter:
In [ ]:
path = untar_data(URLs.IMAGENETTE_160)
To access the image files, we can use get_image_files
:
In [ ]:
t = get_image_files(path)
t[0]
Out[ ]:
Path('/home/jhoward/.fastai/data/imagenette2-160/val/n03417042/n03417042_3752.JPEG')
Or we could do the same thing using just Python’s standard library, with glob
:
In [ ]:
from glob import glob
files = L(glob(f'{path}/**/*.JPEG', recursive=True)).map(Path)
files[0]
Out[ ]:
Path('/home/jhoward/.fastai/data/imagenette2-160/val/n03417042/n03417042_3752.JPEG')
If you look at the source for get_image_files
, you’ll see it uses Python’s os.walk
; this is a faster and more flexible function than glob
, so be sure to try it out.
We can open an image with the Python Imaging Library’s Image
class:
In [ ]:
im = Image.open(files[0])
im
Out[ ]:
In [ ]:
im_t = tensor(im)
im_t.shape
Out[ ]:
torch.Size([160, 213, 3])
That’s going to be the basis of our independent variable. For our dependent variable, we can use Path.parent
from pathlib
. First we’ll need our vocab:
In [ ]:
lbls = files.map(Self.parent.name()).unique(); lbls
Out[ ]:
(#10) ['n03417042','n03445777','n03888257','n03394916','n02979186','n03000684','n03425413','n01440764','n03028079','n02102040']
…and the reverse mapping, thanks to L.val2idx
:
In [ ]:
v2i = lbls.val2idx(); v2i
Out[ ]:
{'n03417042': 0,
'n03445777': 1,
'n03888257': 2,
'n03394916': 3,
'n02979186': 4,
'n03000684': 5,
'n03425413': 6,
'n01440764': 7,
'n03028079': 8,
'n02102040': 9}
That’s all the pieces we need to put together our Dataset
.
Dataset
A Dataset
in PyTorch can be anything that supports indexing (__getitem__
) and len
:
In [ ]:
class Dataset:
def __init__(self, fns): self.fns=fns
def __len__(self): return len(self.fns)
def __getitem__(self, i):
im = Image.open(self.fns[i]).resize((64,64)).convert('RGB')
y = v2i[self.fns[i].parent.name]
return tensor(im).float()/255, tensor(y)
We need a list of training and validation filenames to pass to Dataset.__init__
:
In [ ]:
train_filt = L(o.parent.parent.name=='train' for o in files)
train,valid = files[train_filt],files[~train_filt]
len(train),len(valid)
Out[ ]:
(9469, 3925)
Now we can try it out:
In [ ]:
train_ds,valid_ds = Dataset(train),Dataset(valid)
x,y = train_ds[0]
x.shape,y
Out[ ]:
(torch.Size([64, 64, 3]), tensor(0))
In [ ]:
show_image(x, title=lbls[y]);
As you see, our dataset is returning the independent and dependent variables as a tuple, which is just what we need. We’ll need to be able to collate these into a mini-batch. Generally this is done with torch.stack
, which is what we’ll use here:
In [ ]:
def collate(idxs, ds):
xb,yb = zip(*[ds[i] for i in idxs])
return torch.stack(xb),torch.stack(yb)
Here’s a mini-batch with two items, for testing our collate
:
In [ ]:
x,y = collate([1,2], train_ds)
x.shape,y
Out[ ]:
(torch.Size([2, 64, 64, 3]), tensor([0, 0]))
Now that we have a dataset and a collation function, we’re ready to create DataLoader
. We’ll add two more things here: an optional shuffle
for the training set, and a ProcessPoolExecutor
to do our preprocessing in parallel. A parallel data loader is very important, because opening and decoding a JPEG image is a slow process. One CPU core is not enough to decode images fast enough to keep a modern GPU busy. Here’s our DataLoader
class:
In [ ]:
class DataLoader:
def __init__(self, ds, bs=128, shuffle=False, n_workers=1):
self.ds,self.bs,self.shuffle,self.n_workers = ds,bs,shuffle,n_workers
def __len__(self): return (len(self.ds)-1)//self.bs+1
def __iter__(self):
idxs = L.range(self.ds)
if self.shuffle: idxs = idxs.shuffle()
chunks = [idxs[n:n+self.bs] for n in range(0, len(self.ds), self.bs)]
with ProcessPoolExecutor(self.n_workers) as ex:
yield from ex.map(collate, chunks, ds=self.ds)
Let’s try it out with our training and validation datasets:
In [ ]:
n_workers = min(16, defaults.cpus)
train_dl = DataLoader(train_ds, bs=128, shuffle=True, n_workers=n_workers)
valid_dl = DataLoader(valid_ds, bs=256, shuffle=False, n_workers=n_workers)
xb,yb = first(train_dl)
xb.shape,yb.shape,len(train_dl)
Out[ ]:
(torch.Size([128, 64, 64, 3]), torch.Size([128]), 74)
This data loader is not much slower than PyTorch’s, but it’s far simpler. So if you’re debugging a complex data loading process, don’t be afraid to try doing things manually to help you see exactly what’s going on.
For normalization, we’ll need image statistics. Generally it’s fine to calculate these on a single training mini-batch, since precision isn’t needed here:
In [ ]:
stats = [xb.mean((0,1,2)),xb.std((0,1,2))]
stats
Out[ ]:
[tensor([0.4544, 0.4453, 0.4141]), tensor([0.2812, 0.2766, 0.2981])]
Our Normalize
class just needs to store these stats and apply them (to see why the to_device
is needed, try commenting it out, and see what happens later in this notebook):
In [ ]:
class Normalize:
def __init__(self, stats): self.stats=stats
def __call__(self, x):
if x.device != self.stats[0].device:
self.stats = to_device(self.stats, x.device)
return (x-self.stats[0])/self.stats[1]
We always like to test everything we build in a notebook, as soon as we build it:
In [ ]:
norm = Normalize(stats)
def tfm_x(x): return norm(x).permute((0,3,1,2))
In [ ]:
t = tfm_x(x)
t.mean((0,2,3)),t.std((0,2,3))
Out[ ]:
(tensor([0.3732, 0.4907, 0.5633]), tensor([1.0212, 1.0311, 1.0131]))
Here tfm_x
isn’t just applying Normalize
, but is also permuting the axis order from NHWC
to NCHW
(see <> if you need a reminder of what these acronyms refer to). PIL uses HWC
axis order, which we can’t use with PyTorch, hence the need for this permute
.
That’s all we need for the data for our model. So now we need the model itself!