TfmdLists and Datasets: Transformed Collections
Your data is usually a set of raw items (like filenames, or rows in a DataFrame) to which you want to apply a succession of transformations. We just saw that a succession of transformations is represented by a Pipeline
in fastai. The class that groups together this Pipeline
with your raw items is called TfmdLists
.
TfmdLists
Here is the short way of doing the transformation we saw in the previous section:
In [ ]:
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])
At initialization, the TfmdLists
will automatically call the setup
method of each Transform
in order, providing them not with the raw items but the items transformed by all the previous Transform
s in order. We can get the result of our Pipeline
on any raw element just by indexing into the TfmdLists
:
In [ ]:
t = tls[0]; t[:20]
Out[ ]:
tensor([ 2, 8, 91, 11, 22, 5793, 22, 37, 4910, 34, 11, 8, 13042, 23, 107, 30, 11, 25, 44, 14])
And the TfmdLists
knows how to decode for show purposes:
In [ ]:
tls.decode(t)[:100]
Out[ ]:
'xxbos xxmaj well , " cube " ( 1997 ) , xxmaj vincenzo \'s first movie , was one of the most interesti'
In fact, it even has a show
method:
In [ ]:
tls.show(t)
xxbos xxmaj well , " cube " ( 1997 ) , xxmaj vincenzo 's first movie , was one of the most interesting and tricky ideas that xxmaj i 've ever seen when talking about movies . xxmaj they had just one scenery , a bunch of actors and a plot . xxmaj so , what made it so special were all the effective direction , great dialogs and a bizarre condition that characters had to deal like rats in a labyrinth . xxmaj his second movie , " cypher " ( 2002 ) , was all about its story , but it was n't so good as " cube " but here are the characters being tested like rats again .
" nothing " is something very interesting and gets xxmaj vincenzo coming back to his ' cube days ' , locking the characters once again in a very different space with no time once more playing with the characters like playing with rats in an experience room . xxmaj but instead of a thriller sci - fi ( even some of the promotional teasers and trailers erroneous seemed like that ) , " nothing " is a loose and light comedy that for sure can be called a modern satire about our society and also about the intolerant world we 're living . xxmaj once again xxmaj xxunk amaze us with a great idea into a so small kind of thing . 2 actors and a blinding white scenario , that 's all you got most part of time and you do n't need more than that . xxmaj while " cube " is a claustrophobic experience and " cypher " confusing , " nothing " is completely the opposite but at the same time also desperate .
xxmaj this movie proves once again that a smart idea means much more than just a millionaire budget . xxmaj of course that the movie fails sometimes , but its prime idea means a lot and offsets any flaws . xxmaj there 's nothing more to be said about this movie because everything is a brilliant surprise and a totally different experience that i had in movies since " cube " .
The TfmdLists
is named with an “s” because it can handle a training and a validation set with a splits
argument. You just need to pass the indices of which elements are in the training set, and which are in the validation set:
In [ ]:
cut = int(len(files)*0.8)
splits = [list(range(cut)), list(range(cut,len(files)))]
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize],
splits=splits)
You can then access them through the train
and valid
attributes:
In [ ]:
tls.valid[0][:20]
Out[ ]:
tensor([ 2, 8, 20, 30, 87, 510, 1570, 12, 408, 379, 4196, 10, 8, 20, 30, 16, 13, 12216, 202, 509])
If you have manually written a Transform
that performs all of your preprocessing at once, turning raw items into a tuple with inputs and targets, then TfmdLists
is the class you need. You can directly convert it to a DataLoaders
object with the dataloaders
method. This is what we will do in our Siamese example later in this chapter.
In general, though, you will have two (or more) parallel pipelines of transforms: one for processing your raw items into inputs and one to process your raw items into targets. For instance, here, the pipeline we defined only processes the raw text into inputs. If we want to do text classification, we also have to process the labels into targets.
For this we need to do two things. First we take the label name from the parent folder. There is a function, parent_label
, for this:
In [ ]:
lbls = files.map(parent_label)
lbls
Out[ ]:
(#50000) ['pos','pos','pos','pos','pos','pos','pos','pos','pos','pos'...]
Then we need a Transform
that will grab the unique items and build a vocab with them during setup, then transform the string labels into integers when called. fastai provides this for us; it’s called Categorize
:
In [ ]:
cat = Categorize()
cat.setup(lbls)
cat.vocab, cat(lbls[0])
Out[ ]:
((#2) ['neg','pos'], TensorCategory(1))
To do the whole setup automatically on our list of files, we can create a TfmdLists
as before:
In [ ]:
tls_y = TfmdLists(files, [parent_label, Categorize()])
tls_y[0]
Out[ ]:
TensorCategory(1)
But then we end up with two separate objects for our inputs and targets, which is not what we want. This is where Datasets
comes to the rescue.
Datasets
Datasets
will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result. Like TfmdLists
, it will automatically do the setup for us, and when we index into a Datasets
, it will return us a tuple with the results of each pipeline:
In [ ]:
x_tfms = [Tokenizer.from_folder(path), Numericalize]
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms])
x,y = dsets[0]
x[:20],y
Like a TfmdLists
, we can pass along splits
to a Datasets
to split our data between training and validation sets:
In [ ]:
x_tfms = [Tokenizer.from_folder(path), Numericalize]
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms], splits=splits)
x,y = dsets.valid[0]
x[:20],y
Out[ ]:
(tensor([ 2, 8, 20, 30, 87, 510, 1570, 12, 408, 379, 4196, 10, 8, 20, 30, 16, 13, 12216, 202, 509]),
TensorCategory(0))
It can also decode any processed tuple or show it directly:
In [ ]:
t = dsets.valid[0]
dsets.decode(t)
Out[ ]:
('xxbos xxmaj this movie had horrible lighting and terrible camera movements . xxmaj this movie is a jumpy horror flick with no meaning at all . xxmaj the slashes are totally fake looking . xxmaj it looks like some 17 year - old idiot wrote this movie and a 10 year old kid shot it . xxmaj with the worst acting you can ever find . xxmaj people are tired of knives . xxmaj at least move on to guns or fire . xxmaj it has almost exact lines from " when a xxmaj stranger xxmaj calls " . xxmaj with gruesome killings , only crazy people would enjoy this movie . xxmaj it is obvious the writer does n\'t have kids or even care for them . i mean at show some mercy . xxmaj just to sum it up , this movie is a " b " movie and it sucked . xxmaj just for your own sake , do n\'t even think about wasting your time watching this crappy movie .',
'neg')
The last step is to convert our Datasets
object to a DataLoaders
, which can be done with the dataloaders
method. Here we need to pass along a special argument to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to before_batch
:
In [ ]:
dls = dsets.dataloaders(bs=64, before_batch=pad_input)
dataloaders
directly calls DataLoader
on each subset of our Datasets
. fastai’s DataLoader
expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization, but the most important ones that you should know are:
after_item
:: Applied on each item after grabbing it inside the dataset. This is the equivalent ofitem_tfms
inDataBlock
.before_batch
:: Applied on the list of items before they are collated. This is the ideal place to pad items to the same size.after_batch
:: Applied on the batch as a whole after its construction. This is the equivalent ofbatch_tfms
inDataBlock
.
As a conclusion, here is the full code necessary to prepare the data for text classification:
In [ ]:
tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]
files = get_text_files(path, folders = ['train', 'test'])
splits = GrandparentSplitter(valid_name='test')(files)
dsets = Datasets(files, tfms, splits=splits)
dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)
The two differences from the previous code are the use of GrandparentSplitter
to split our training and validation data, and the dl_type
argument. This is to tell dataloaders
to use the SortedDL
class of DataLoader
, and not the usual one. SortedDL
constructs batches by putting samples of roughly the same lengths into batches.
This does the exact same thing as our previous DataBlock
:
In [ ]:
path = untar_data(URLs.IMDB)
dls = DataBlock(
blocks=(TextBlock.from_folder(path),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path)
But now, you know how to customize every single piece of it!
Let’s practice what we just learned about this mid-level API for data preprocessing, using a computer vision example now.