19 A fastai Learner from Scratch - Data - 《The fastai book》

Data
- Dataset

Data

Have a look at the source to untar_data to see how it works. We’ll use it here to access the 160-pixel version of Imagenette for use in this chapter:

In [ ]:

path = untar_data(URLs.IMAGENETTE_160)

To access the image files, we can use get_image_files:

In [ ]:

t = get_image_files(path)
t[0]

Out[ ]:

Path('/home/jhoward/.fastai/data/imagenette2-160/val/n03417042/n03417042_3752.JPEG')

Or we could do the same thing using just Python’s standard library, with glob:

In [ ]:

from glob import glob
files = L(glob(f'{path}/**/*.JPEG', recursive=True)).map(Path)
files[0]

Out[ ]:

Path('/home/jhoward/.fastai/data/imagenette2-160/val/n03417042/n03417042_3752.JPEG')

If you look at the source for get_image_files, you’ll see it uses Python’s os.walk; this is a faster and more flexible function than glob, so be sure to try it out.

We can open an image with the Python Imaging Library’s Image class:

In [ ]:

im = Image.open(files[0])
im

Out[ ]:

In [ ]:

im_t = tensor(im)
im_t.shape

Out[ ]:

torch.Size([160, 213, 3])

That’s going to be the basis of our independent variable. For our dependent variable, we can use Path.parent from pathlib. First we’ll need our vocab:

In [ ]:

lbls = files.map(Self.parent.name()).unique(); lbls

Out[ ]:

(#10) ['n03417042','n03445777','n03888257','n03394916','n02979186','n03000684','n03425413','n01440764','n03028079','n02102040']

…and the reverse mapping, thanks to L.val2idx:

In [ ]:

v2i = lbls.val2idx(); v2i

Out[ ]:

{'n03417042': 0,
 'n03445777': 1,
 'n03888257': 2,
 'n03394916': 3,
 'n02979186': 4,
 'n03000684': 5,
 'n03425413': 6,
 'n01440764': 7,
 'n03028079': 8,
 'n02102040': 9}

That’s all the pieces we need to put together our Dataset.

Dataset

A Dataset in PyTorch can be anything that supports indexing (__getitem__) and len:

In [ ]:

class Dataset:
    def __init__(self, fns): self.fns=fns
    def __len__(self): return len(self.fns)
    def __getitem__(self, i):
        im = Image.open(self.fns[i]).resize((64,64)).convert('RGB')
        y = v2i[self.fns[i].parent.name]
        return tensor(im).float()/255, tensor(y)

We need a list of training and validation filenames to pass to Dataset.__init__:

In [ ]:

train_filt = L(o.parent.parent.name=='train' for o in files)
train,valid = files[train_filt],files[~train_filt]
len(train),len(valid)

Out[ ]:

(9469, 3925)

Now we can try it out:

In [ ]:

train_ds,valid_ds = Dataset(train),Dataset(valid)
x,y = train_ds[0]
x.shape,y

Out[ ]:

(torch.Size([64, 64, 3]), tensor(0))

In [ ]:

show_image(x, title=lbls[y]);

As you see, our dataset is returning the independent and dependent variables as a tuple, which is just what we need. We’ll need to be able to collate these into a mini-batch. Generally this is done with torch.stack, which is what we’ll use here:

In [ ]:

def collate(idxs, ds): 
    xb,yb = zip(*[ds[i] for i in idxs])
    return torch.stack(xb),torch.stack(yb)

Here’s a mini-batch with two items, for testing our collate:

In [ ]:

x,y = collate([1,2], train_ds)
x.shape,y

Out[ ]:

(torch.Size([2, 64, 64, 3]), tensor([0, 0]))

Now that we have a dataset and a collation function, we’re ready to create DataLoader. We’ll add two more things here: an optional shuffle for the training set, and a ProcessPoolExecutor to do our preprocessing in parallel. A parallel data loader is very important, because opening and decoding a JPEG image is a slow process. One CPU core is not enough to decode images fast enough to keep a modern GPU busy. Here’s our DataLoader class:

In [ ]:

class DataLoader:
    def __init__(self, ds, bs=128, shuffle=False, n_workers=1):
        self.ds,self.bs,self.shuffle,self.n_workers = ds,bs,shuffle,n_workers
    def __len__(self): return (len(self.ds)-1)//self.bs+1
    def __iter__(self):
        idxs = L.range(self.ds)
        if self.shuffle: idxs = idxs.shuffle()
        chunks = [idxs[n:n+self.bs] for n in range(0, len(self.ds), self.bs)]
        with ProcessPoolExecutor(self.n_workers) as ex:
            yield from ex.map(collate, chunks, ds=self.ds)

Let’s try it out with our training and validation datasets:

In [ ]:

n_workers = min(16, defaults.cpus)
train_dl = DataLoader(train_ds, bs=128, shuffle=True, n_workers=n_workers)
valid_dl = DataLoader(valid_ds, bs=256, shuffle=False, n_workers=n_workers)
xb,yb = first(train_dl)
xb.shape,yb.shape,len(train_dl)

Out[ ]:

(torch.Size([128, 64, 64, 3]), torch.Size([128]), 74)

This data loader is not much slower than PyTorch’s, but it’s far simpler. So if you’re debugging a complex data loading process, don’t be afraid to try doing things manually to help you see exactly what’s going on.

For normalization, we’ll need image statistics. Generally it’s fine to calculate these on a single training mini-batch, since precision isn’t needed here:

In [ ]:

stats = [xb.mean((0,1,2)),xb.std((0,1,2))]
stats

Out[ ]:

[tensor([0.4544, 0.4453, 0.4141]), tensor([0.2812, 0.2766, 0.2981])]

Our Normalize class just needs to store these stats and apply them (to see why the to_device is needed, try commenting it out, and see what happens later in this notebook):

In [ ]:

class Normalize:
    def __init__(self, stats): self.stats=stats
    def __call__(self, x):
        if x.device != self.stats[0].device:
            self.stats = to_device(self.stats, x.device)
        return (x-self.stats[0])/self.stats[1]

We always like to test everything we build in a notebook, as soon as we build it:

In [ ]:

norm = Normalize(stats)
def tfm_x(x): return norm(x).permute((0,3,1,2))

In [ ]:

t = tfm_x(x)
t.mean((0,2,3)),t.std((0,2,3))

Out[ ]:

(tensor([0.3732, 0.4907, 0.5633]), tensor([1.0212, 1.0311, 1.0131]))

Here tfm_x isn’t just applying Normalize, but is also permuting the axis order from NHWC to NCHW (see <> if you need a reminder of what these acronyms refer to). PIL uses HWC axis order, which we can’t use with PyTorch, hence the need for this permute.

That’s all we need for the data for our model. So now we need the model itself!