- data_block
- The data block API
- Examples of use
- Step 1: Provide inputs
- Step 2: Split the data between the training and the validation set
split_none
[source] [test]split_by_rand_pct
[source] [test]split_subsets
[source] [test]split_by_files
[source] [test]split_by_fname_file
[source] [test]split_by_folder
[source] [test]split_by_idx
[source] [test]split_by_idxs
[source] [test]split_by_list
[source] [test]split_by_valid_func
[source] [test]split_from_df
[source] [test]
- Step 3: Label the inputs
class
CategoryList
[source] [test]class
MultiCategoryList
[source] [test]class
FloatList
[source] [test]class
EmptyLabelList
[source] [test]- Invisible step: preprocessing
- Optional steps
- Step 4: convert to a
DataBunch
- Inner classes
- Helper functions
data_block
The data block API
The data block API
The data block API lets you customize the creation of a DataBunch
by isolating the underlying parts of that process in separate blocks, mainly:
- Where are the inputs and how to create them?
- How to split the data into a training and validation sets?
- How to label the inputs?
- What transforms to apply?
- How to add a test set?
- How to wrap in dataloaders and create the
DataBunch
?
Each of these may be addressed with a specific block designed for your unique setup. Your inputs might be in a folder, a csv file, or a dataframe. You may want to split them randomly, by certain indices or depending on the folder they are in. You can have your labels in your csv file or your dataframe, but it may come from folders or a specific function of the input. You may choose to add data augmentation or not. A test set is optional too. Finally you have to set the arguments to put the data together in a DataBunch
(batch size, collate function…)
The data block API is called as such because you can mix and match each one of those blocks with the others, allowing for a total flexibility to create your customized DataBunch
for training, validation and testing. The factory methods of the various DataBunch
are great for beginners but you can’t always make your data fit in the tracks they require.
As usual, we’ll begin with end-to-end examples, then switch to the details of each of those parts.
Examples of use
Let’s begin with our traditional MNIST example.
from fastai.vision import *
path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
path.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]
(path/'train').ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/3'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/7')]
In vision.data
, we can create a DataBunch
suitable for image classification by simply typing:
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=64)
This is a shortcut method which is aimed at data that is in folders following an ImageNet style, with the train
and valid
directories, each containing one subdirectory per class, where all the labelled pictures are. There is also a test
directory containing unlabelled pictures.
Here is the same code, but this time using the data block API, which can work with any style of a dataset. All the stages, which will be explained below, can be grouped together like this:
data = (ImageList.from_folder(path) #Where to find the data? -> in path and its subfolders
.split_by_folder() #How to split in train/valid? -> use the folders
.label_from_folder() #How to label? -> depending on the folder of the filenames
.add_test_folder() #Optionally add a test set (here default name is test)
.transform(tfms, size=64) #Data augmentation? -> use tfms with a size of 64
.databunch()) #Finally? -> use the defaults for conversion to ImageDataBunch
Now we can look at the created DataBunch:
data.show_batch(3, figsize=(6,6), hide_axis=False)
Let’s look at another example from vision.data
with the planet dataset. This time, it’s a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:
planet = untar_data(URLs.PLANET_TINY)
planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
pd.read_csv(planet/"labels.csv").head()
image_name | tags | |
---|---|---|
0 | train_31112 | clear primary |
1 | train_4300 | partly_cloudy primary water |
2 | train_39539 | clear primary water |
3 | train_12498 | agriculture clear primary road |
4 | train_9320 | clear primary |
data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', label_delim = ' ', ds_tfms=planet_tfms)
With the data block API we can rewrite this like that:
planet.ls()
[PosixPath('/home/ubuntu/.fastai/data/planet_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/planet_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/planet_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/planet_tiny/models')]
pd.read_csv(planet/"labels.csv").head()
image_name | tags | |
---|---|---|
0 | train_31112 | clear primary |
1 | train_4300 | partly_cloudy primary water |
2 | train_39539 | clear primary water |
3 | train_12498 | agriculture clear primary road |
4 | train_9320 | clear primary |
data = (ImageList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')
#Where to find the data? -> in planet 'train' folder
.split_by_rand_pct()
#How to split in train/valid? -> randomly with the default 20% in valid
.label_from_df(label_delim=' ')
#How to label? -> use the second column of the csv file and split the tags by ' '
.transform(planet_tfms, size=128)
#Data augmentation? -> use tfms with a size of 128
.databunch())
#Finally -> use the defaults for conversion to databunch
data.show_batch(rows=2, figsize=(9,7))
The data block API also allows you to get your data together in problems for which there is no direct ImageDataBunch
factory method. For a segmentation task, for instance, we can use it to quickly get a DataBunch
. Let’s take the example of the camvid dataset. The images are in an ‘images’ folder and their corresponding mask is in a ‘labels’ folder.
camvid = untar_data(URLs.CAMVID_TINY)
path_lbl = camvid/'labels'
path_img = camvid/'images'
We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road…)
codes = np.loadtxt(camvid/'codes.txt', dtype=str); codes
array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',
'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',
'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='<U17')
And we define the following function that infers the mask filename from the image filename.
get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'
Then we can easily define a DataBunch
using the data block API. Here we need to use tfm_y=True
in the transform call because we need the same transforms to be applied to the target mask as were applied to the image. Side note: For further control over which transformations are used on the target, each transformation has a use_on_y
parameter
data = (SegmentationItemList.from_folder(path_img)
#Where to find the data? -> in path_img and its subfolders
.split_by_rand_pct()
#How to split in train/valid? -> randomly with the default 20% in valid
.label_from_func(get_y_fn, classes=codes)
#How to label? -> use the label function on the file name of the data
.transform(get_transforms(), tfm_y=True, size=128)
#Data augmentation? -> use tfms with a size of 128, also transform the label images
.databunch())
#Finally -> use the defaults for conversion to databunch
data.show_batch(rows=2, figsize=(7,5))
Another example for object detection. We use our tiny sample of the COCO dataset here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename.
coco = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
get_y_func = lambda o:img2bbox[o.name]
The following code is very similar to what we saw before. The only new addition is the use of a special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes.
data = (ObjectItemList.from_folder(coco)
#Where are the images? -> in coco and its subfolders
.split_by_rand_pct()
#How to split in train/valid? -> randomly with the default 20% in valid
.label_from_func(get_y_func)
#How to find the labels? -> use get_y_func on the file name of the data
.transform(get_transforms(), tfm_y=True)
#Data augmentation? -> Standard transforms; also transform the label images
.databunch(bs=16, collate_fn=bb_pad_collate))
#Finally we convert to a DataBunch, use a batch size of 16,
# and we use bb_pad_collate to collate the data into a mini-batch
data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))
But vision isn’t the only application where the data block API works. It can also be used for text and tabular data. With our sample of the IMDB dataset (labelled texts in a csv file), here is how to get the data together for a language model.
from fastai.text import *
imdb = untar_data(URLs.IMDB_SAMPLE)
data_lm = (TextList
.from_csv(imdb, 'texts.csv', cols='text')
#Where are the text? Column 'text' of texts.csv
.split_by_rand_pct()
#How to split it? Randomly with the default 20% in valid
.label_for_lm()
#Label it for a language model
.databunch())
#Finally we convert to a DataBunch
data_lm.show_batch()
idx | text |
---|---|
0 | ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk ! xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is |
1 | , co - billed with xxup the xxup xxunk xxup vampire . a xxmaj spanish - xxmaj italian co - production where a series of women in a village are being murdered around the same time a local count named xxmaj yanos xxmaj xxunk is seen on xxunk , riding off with his ‘ man - eating ‘ dog behind him . n n xxmaj the xxunk already suspect |
2 | sad relic that is well worth seeing . xxbos i caught this on the dish last night . i liked the movie . i xxunk to xxmaj russia 3 different times ( xxunk our 2 kids ) . i ca n’t put my finger on exactly why i liked this movie other than seeing “ bad “ turn “ good “ and “ good “ turn “ semi - bad |
3 | pushed him along . xxmaj the story ( if it can be called that ) is so full of holes it ‘s almost funny , xxmaj it never really explains why the hell he survived in the first place , or needs human flesh in order to survive . xxmaj the script is poorly written and the dialogue xxunk on just plane stupid . xxmaj the climax to movie ( |
4 | the xxunk of the xxmaj xxunk xxmaj race and had the xxunk of some of those racist xxunk . xxmaj fortunately , nothing happened like the incident in the movie where the young xxmaj caucasian man went off and started shooting at a xxunk gathering . n n i can only hope and pray that nothing like that ever will happen . n n xxmaj so is “ |
For a classification problem, we just have to change the way labeling is done. Here we use the csv column label
.
data_clas = (TextList.from_csv(imdb, 'texts.csv', cols='text')
.split_from_df(col='is_valid')
.label_from_df(cols='label')
.databunch())
data_clas.show_batch()
text | target |
---|---|
xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review n n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it ‘s warm and gooey , but you ‘re not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj | negative |
xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there ‘s just no getting around that , and it ‘s hard to actually put one ‘s feeling for this film into words . xxmaj it ‘s not one of those films that tries too hard , nor does it come up with | positive |
xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of “ xxmaj at xxmaj the xxmaj movies “ in taking xxmaj steven xxmaj soderbergh to task . n n xxmaj it ‘s usually satisfying to watch a film director change his style / | negative |
xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . n n xxmaj the format is the same as xxmaj max xxmaj xxunk ‘ “ xxmaj la xxmaj xxunk | positive |
xxbos xxmaj many neglect that this is n’t just a classic due to the fact that it ‘s the first xxup 3d game , or even the first xxunk - up . xxmaj it ‘s also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics | positive |
Lastly, for tabular data, we just have to pass the name of our categorical and continuous variables as an extra argument. We also add some PreProcessor
s that are going to be applied to our data once the splitting and labelling is done.
from fastai.tabular import *
adult = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult/'adult.csv')
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['education-num', 'hours-per-week', 'age', 'capital-loss', 'fnlwgt', 'capital-gain']
procs = [FillMissing, Categorify, Normalize]
data = (TabularList.from_df(df, path=adult, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(valid_idx=range(800,1000))
.label_from_df(cols=dep_var)
.databunch())
data.show_batch()
workclass | education | marital-status | occupation | relationship | race | sex | native-country | education-num_na | education-num | hours-per-week | age | capital-loss | fnlwgt | capital-gain | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
? | Doctorate | Married-civ-spouse | ? | Husband | Amer-Indian-Eskimo | Male | United-States | False | 2.3157 | -0.0356 | 1.7161 | -0.2164 | -1.1496 | -0.1459 | >=50k |
Private | Some-college | Never-married | Sales | Own-child | White | Male | United-States | False | -0.0312 | -0.4406 | -1.4357 | -0.2164 | -0.1893 | -0.1459 | <50k |
Private | Some-college | Never-married | Protective-serv | Own-child | White | Male | United-States | False | -0.0312 | -2.0606 | -1.2891 | -0.2164 | 1.1154 | -0.1459 | <50k |
Private | HS-grad | Married-civ-spouse | Handlers-cleaners | Wife | White | Female | Mexico | False | -0.4224 | -0.0356 | -0.7027 | -0.2164 | 0.0779 | -0.1459 | >=50k |
Private | HS-grad | Married-civ-spouse | Tech-support | Husband | White | Male | United-States | False | -0.4224 | 3.2043 | -0.1163 | -0.2164 | -0.6858 | -0.1459 | >=50k |
Step 1: Provide inputs
The basic class to get your inputs is the following one. It’s also the same class that will contain all of your labels (hence the name ItemList
).
class
ItemList
[source][test]
ItemList
(items
:Iterator
[T_co
],path
:PathOrStr
='.'
,label_cls
:Callable
=None
,inner_df
:Any
=None
,processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
,x
:ItemList
=None
,ignore_empty
:bool
=False
) Tests found forItemList
:
Some other tests where ItemList
is used:
pytest -sv tests/test_data_block.py::test_add_test
[source]pytest -sv tests/test_data_block.py::test_category
[source]pytest -sv tests/test_data_block.py::test_category_processor_existing_class
[source]pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class
[source]pytest -sv tests/test_data_block.py::test_filter_by_folder
[source]pytest -sv tests/test_data_block.py::test_filter_by_rand
[source]pytest -sv tests/test_data_block.py::test_from_df
[source]pytest -sv tests/test_data_block.py::test_multi_category
[source]pytest -sv tests/test_data_block.py::test_regression
[source]pytest -sv tests/test_data_block.py::test_split_subsets
[source]pytest -sv tests/test_data_block.py::test_splitdata_datasets
[source]
To run tests please refer to this guide.
A collection of items with __len__
and __getitem__
with ndarray
indexing semantics.
This class regroups the inputs for our model in items
and saves a path
attribute which is where it will look for any files (image files, csv file with labels…). label_cls
will be called to create the labels from the result of the label function, inner_df
is an underlying dataframe, and processor
is to be applied to the inputs after the splitting and labeling.
It has multiple subclasses depending on the type of data you’re handling. Here is a quick list:
CategoryList
for labels in classificationMultiCategoryList
for labels in a multi classification problemFloatList
for float labels in a regression problemImageList
for data that are imagesSegmentationItemList
likeImageList
but will default labels toSegmentationLabelList
SegmentationLabelList
for segmentation masksObjectItemList
likeImageList
but will default labels toObjectLabelList
ObjectLabelList
for object detectionPointsItemList
for points (of the typeImagePoints
)ImageImageList
for image to image tasksTextList
for text dataTextList
for text data stored in filesTabularList
for tabular dataCollabList
for collaborative filtering
We can get a little glimpse of how ItemList
‘s basic attributes and methods behave with the following code examples.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
il_data = ItemList.from_folder(path_data, extensions=['.csv'])
il_data
ItemList (1 items)
/home/gilbert/.fastai/data/mnist_tiny/labels.csv
Path: /home/gilbert/.fastai/data/mnist_tiny
Here is how to access the path of ItemList
and the actual items
(here files) in the path.
il_data.path
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny')
il_data.items
array([PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv')], dtype=object)
len(il_data)
gives you the count of files inside il_data
and you can access individual items using index.
len(il_data)
3
ItemList
returns a single item with a single index, but returns an ItemList
if given a list of indexes.
il_data[1]
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv')
il_data[:1]
ItemList (1 items)
/home/ubuntu/.fastai/data/mnist_tiny/labels.csv
Path: /home/ubuntu/.fastai/data/mnist_tiny
With il_data.add
we can perform in_place concatenate another ItemList
object.
il_data.add(il_data); il_data
ItemList (6 items)
/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/history.csv,/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv,/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/history.csv
Path: /home/ubuntu/.fastai/data/mnist_tiny
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]
itemlist = ItemList.from_folder(path_data/'test')
itemlist
ItemList (20 items)
/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png,/home/ubuntu/.fastai/data/mnist_tiny/test/5071.png,/home/ubuntu/.fastai/data/mnist_tiny/test/617.png,/home/ubuntu/.fastai/data/mnist_tiny/test/585.png,/home/ubuntu/.fastai/data/mnist_tiny/test/2032.png
Path: /home/ubuntu/.fastai/data/mnist_tiny/test
As we can see, the files do not necesarily return in alpha-numeric order by default. In the above: 1503.png, … 617.png, 585.png …
This is OK when you’re always using the same machine, as the same dataset should return in the same order. But when building a datablock on one machine (say GCP) and then porting the same code to a different machine (say your laptop) that same dataset and code might return the files in a different order.
Since all random operations use the loaded order of the dataset as the starting point, you will not be able to replicate any random operations, say randomly splitting the data into 80% train, and 20% validation, even while correctly seeding.
The solution is to use presort=True
in the .from_folder()
method. As can be seen below, with that argument turned on, the file is returned in ascending order, and this behavior will match across machines and across platforms. Now you can reproduce any random operation you perform on the loaded data.
itemlist = ItemList.from_folder(path_data/'test', presort=True)
itemlist
ItemList (20 items)
/home/user/.fastai/data/mnist_tiny/test/1503.png,/home/user/.fastai/data/mnist_tiny/test/1605.png,/home/user/.fastai/data/mnist_tiny/test/1883.png,/home/user/.fastai/data/mnist_tiny/test/2032.png,/home/user/.fastai/data/mnist_tiny/test/205.png
Path: /home/user/.fastai/data/mnist_tiny/test
How is the output above generated?
behind the scenes, executing itemlist
calls ItemList.__repr__
which basically prints out itemlist[0]
to itemlist[4]
itemlist[0]
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png')
and itemlist[0]
basically calls itemlist.get(0)
which returns itemlist.items[0]
. That’s why we have outputs like above.
Once you have selected the class that is suitable, you can instantiate it with one of the following factory methods
from_folder
[source][test]
from_folder
(path
:PathOrStr
,extensions
:StrList
=None
,recurse
:bool
=True
,exclude
:OptStrList
=None
,include
:OptStrList
=None
,processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
,presort
:Optional
[bool
]=False
, **kwargs
) →ItemList
Tests found forfrom_folder
:
Some other tests where from_folder
is used:
pytest -sv tests/test_data_block.py::test_wrong_order
[source]
To run tests please refer to this guide.
Create an ItemList
in path
from the filenames that have a suffix in extensions
. recurse
determines if we search subfolders.
path = untar_data(URLs.MNIST_TINY)
path.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]
ImageList.from_folder(path)
ImageList (1428 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_tiny
path
is your root data folder. In the path
directory you have train and valid folders which would contain your images. For the below example, train folder contains two folders/classes cat and dog.
from_df
[source][test]
from_df
(df
:DataFrame
,path
:PathOrStr
='.'
,cols
:IntsOrStrs
=0
,processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
, **kwargs
) →ItemList
Tests found forfrom_df
:
pytest -sv tests/test_data_block.py::test_from_df
[source]
Some other tests where from_df
is used:
pytest -sv tests/test_data_block.py::test_category
[source]pytest -sv tests/test_data_block.py::test_category_processor_existing_class
[source]pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class
[source]pytest -sv tests/test_data_block.py::test_multi_category
[source]pytest -sv tests/test_data_block.py::test_regression
[source]
To run tests please refer to this guide.
Create an ItemList
in path
from the inputs in the cols
of df
.
Dataframe has 2 columns. The first column is the path to the image and the second column contains label id for that image. In case you have multi-labels (i.e more than one label for a single image), you will have a space (as determined by label_delim
argument of label_from_df
) seperated string in the labels column.
from_df
and from_csv
can be used in a more general way. In cases you are not able to figure out how to get your ImageList, it is very easy to make a csv file with the above format.
How to set path
? path
refers to your root data directory. So the paths in your csv file should be relative to path
and not absolute paths. In the below example, in labels.csv the paths to the images are path + train/3/7463.png
path = untar_data(URLs.MNIST_SAMPLE)
path.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_sample/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/item_list.txt'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/trained_model.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/valid')]
df = pd.read_csv(path/'labels.csv')
df.head()
name | label | |
---|---|---|
0 | train/3/7463.png | 0 |
1 | train/3/21102.png | 0 |
2 | train/3/31559.png | 0 |
3 | train/3/46882.png | 0 |
4 | train/3/26209.png | 0 |
ImageList.from_df(df, path)
ImageList (14434 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
from_csv
[source][test]
from_csv
(path
:PathOrStr
,csv_name
:str
,cols
:IntsOrStrs
=0
,delimiter
:str
=None
,header
:str
='infer'
,processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
, **kwargs
) →ItemList
No tests found forfrom_csv
. To contribute a test please refer to this guide and this discussion.
Create an ItemList
in path
from the inputs in the cols
of path/csv_name
path = untar_data(URLs.MNIST_SAMPLE)
path.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_sample/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/item_list.txt'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/trained_model.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_sample/valid')]
ImageList.from_csv(path, 'labels.csv')
ImageList (14434 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
Optional step: filter your data
The factory method may have grabbed too many items. For instance, if you were searching sub folders with the from_folder
method, you may have gotten files you don’t want. To remove those, you can use one of the following methods.
filter_by_func
[source][test]
filter_by_func
(func
:Callable
) →ItemList
No tests found forfilter_by_func
. To contribute a test please refer to this guide and this discussion.
Only keep elements for which func
returns True
.
path = untar_data(URLs.MNIST_SAMPLE)
df = pd.read_csv(path/'labels.csv')
df.head()
name | label | |
---|---|---|
0 | train/3/7463.png | 0 |
1 | train/3/21102.png | 0 |
2 | train/3/31559.png | 0 |
3 | train/3/46882.png | 0 |
4 | train/3/26209.png | 0 |
Suppose that you only want to keep images with a suffix “.png”. Well, this method will do magic for you.
Path(df.name[0]).suffix
'.png'
ImageList.from_df(df, path).filter_by_func(lambda fname: Path(fname).suffix == '.png')
ImageList (14434 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
filter_by_folder
[source][test]
filter_by_folder
(include
=None
,exclude
=None
) Tests found forfilter_by_folder
:
pytest -sv tests/test_data_block.py::test_filter_by_folder
[source]
To run tests please refer to this guide.
Only keep filenames in include
folder or reject the ones in exclude
.
filter_by_rand
[source][test]
filter_by_rand
(p
:float
,seed
:int
=None
) Tests found forfilter_by_rand
:
pytest -sv tests/test_data_block.py::test_filter_by_rand
[source]
To run tests please refer to this guide.
Keep random sample of items
with probability p
and an optional seed
.
path = untar_data(URLs.MNIST_SAMPLE)
ImageList.from_folder(path).filter_by_rand(0.5)
ImageList (7267 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
Contrast the number of items with the list created without the filter.
ImageList.from_folder(path)
ImageList (14434 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
to_text
[source][test]
to_text
(fn
:str
) No tests found forto_text
. To contribute a test please refer to this guide and this discussion.
Save self.items
to fn
in self.path
.
path = untar_data(URLs.MNIST_SAMPLE)
pd.read_csv(path/'labels.csv').head()
name | label | |
---|---|---|
0 | train/3/7463.png | 0 |
1 | train/3/21102.png | 0 |
2 | train/3/31559.png | 0 |
3 | train/3/46882.png | 0 |
4 | train/3/26209.png | 0 |
file_name = "item_list.txt"
ImageList.from_folder(path).to_text(file_name)
! cat {path/file_name} | head
train/3/5736.png
train/3/35272.png
train/3/26596.png
train/3/42120.png
train/3/39675.png
train/3/47881.png
train/3/38241.png
train/3/59054.png
train/3/9932.png
train/3/50184.png
cat: write error: Broken pipe
use_partial_data
[source][test]
use_partial_data
(sample_pct
:float
=0.01
,seed
:int
=None
) →ItemList
No tests found foruse_partial_data
. To contribute a test please refer to this guide and this discussion.
Use only a sample of sample_pct
of the full dataset and an optional seed
.
path = untar_data(URLs.MNIST_SAMPLE)
ImageList.from_folder(path).use_partial_data(0.5)
ImageList (7217 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
Contrast the number of items with the list created without the filter.
ImageList.from_folder(path)
ImageList (14434 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample
Writing your own ItemList
First check if you can’t easily customize one of the existing subclass by:
- subclassing an existing one and replacing the
get
method (or theopen
method if you’re dealing with images) - applying a custom
processor
(see step 4) - changing the default
label_cls
for the label creation - adding a default
PreProcessor
with the_processor
class variable
If this isn’t the case and you really need to write your own class, there is a full tutorial that explains how to proceed.
analyze_pred
[source][test]
analyze_pred
(pred
:Tensor
) No tests found foranalyze_pred
. To contribute a test please refer to this guide and this discussion.
Called on pred
before reconstruct
for additional preprocessing.
get
[source][test]
get
(i
) →Any
No tests found forget
. To contribute a test please refer to this guide and this discussion.
Subclass if you want to customize how to create item i
from self.items
.
We will have a glimpse of how get
work with the following demo.
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]
il_data_base = ItemList.from_folder(path=path_data, extensions=['.png'], include=['test'])
il_data_base
ItemList (20 items)
/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png,/home/ubuntu/.fastai/data/mnist_tiny/test/5071.png,/home/ubuntu/.fastai/data/mnist_tiny/test/617.png,/home/ubuntu/.fastai/data/mnist_tiny/test/585.png,/home/ubuntu/.fastai/data/mnist_tiny/test/2032.png
Path: /home/ubuntu/.fastai/data/mnist_tiny
get
is used inexplicitly within il_data_base[15]
. il_data_base.get(15)
gives the same result here, because its defulat it’s to return that.
il_data_base[15]
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test/6736.png')
While creating your custom ItemList
however, you can override this function to do some things to your item (like opening an image).
il_data_image = ImageList.from_folder(path=path_data, extensions=['.png'], include=['test'])
il_data_image
ImageList (20 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_tiny
Again, normally get
is used inexplicitly within il_data_image[15]
.
il_data_image[15]
The reason why an image is printed out instead of a FilePath object, is ImageList.get
overwrites ItemList.get
and use ImageList.open
to print an image.
new
[source][test]
new
(items
:Iterator
[T_co
],processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
, **kwargs
) →ItemList
No tests found fornew
. To contribute a test please refer to this guide and this discussion.
Create a new ItemList
from items
, keeping the same attributes.
You’ll never need to subclass this normally, just don’t forget to add to self.copy_new
the names of the arguments that needs to be copied each time new
is called in __init__
.
We will get a feel of how new
works with the following examples.
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]
itemlist1 = ItemList.from_folder(path=path_data/'valid', extensions=['.png'])
itemlist1
ItemList (699 items)
/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png
Path: /home/ubuntu/.fastai/data/mnist_tiny/valid
As you will see below, copy_new
allows us to borrow any argument and its value from itemlist1
, and itemlist1.new(itemlist1.items)
allows us to use items
and arguments inside copy_new
to create another ItemList
by calling ItemList.__init__
.
itemlist1.copy_new == ['x', 'label_cls', 'path']
True
((itemlist1.x == itemlist1.label_cls == itemlist1.inner_df == None)
and (itemlist1.path == Path('/Users/Natsume/.fastai/data/mnist_tiny/valid')))
False
You can select any argument from ItemList.__init__
‘s signature and change their values.
itemlist1.copy_new = ['x', 'label_cls', 'path', 'inner_df']
itemlist1.x = itemlist1.label_cls = itemlist1.path = itemlist1.inner_df = 'test'
itemlist2 = itemlist1.new(items=itemlist1.items)
(itemlist2.inner_df == itemlist2.x == itemlist2.label_cls == 'test'
and itemlist2.path == Path('test'))
True
reconstruct
[source][test]
reconstruct
(t
:Tensor
,x
:Tensor
=None
) No tests found forreconstruct
. To contribute a test please refer to this guide and this discussion.
Reconstruct one of the underlying item for its data t
.
Step 2: Split the data between the training and the validation set
This step is normally straightforward, you just have to pick one of the following functions depending on what you need.
split_none
[source][test]
split_none
() No tests found forsplit_none
. To contribute a test please refer to this guide and this discussion.
Don’t split the data and create an empty validation set.
split_by_rand_pct
[source][test]
split_by_rand_pct
(valid_pct
:float
=0.2
,seed
:int
=None
) →ItemLists
Tests found forsplit_by_rand_pct
:
pytest -sv tests/test_data_block.py::test_splitdata_datasets
[source]
Some other tests where split_by_rand_pct
is used:
pytest -sv tests/test_data_block.py::test_regression
[source]
To run tests please refer to this guide.
Split the items randomly by putting valid_pct
in the validation set, optional seed
can be passed.
split_subsets
[source][test]
split_subsets
(train_size
:float
,valid_size
:float
,seed
=None
) →ItemLists
Tests found forsplit_subsets
:
pytest -sv tests/test_data_block.py::test_split_subsets
[source]
To run tests please refer to this guide.
Split the items into train set with size train_size * n
and valid set with size valid_size * n
.
This function is handy if you want to work with subsets of specific sizes, e.g., you want to use 20% of the data for the validation dataset, but you only want to train on a small subset of the rest of the data: split_subsets(train_size=0.08, valid_size=0.2)
.
split_by_files
[source][test]
split_by_files
(valid_names
:ItemList
) →ItemLists
No tests found forsplit_by_files
. To contribute a test please refer to this guide and this discussion.
Split the data by using the names in valid_names
for validation.
split_by_fname_file
[source][test]
split_by_fname_file
(fname
:PathOrStr
,path
:PathOrStr
=None
) →ItemLists
No tests found forsplit_by_fname_file
. To contribute a test please refer to this guide and this discussion.
Split the data by using the names in fname
for the validation set. path
will override self.path
.
Internally makes a call to split_by_files
. fname
contains your image file names like 0001.png.
split_by_folder
[source][test]
split_by_folder
(train
:str
='train'
,valid
:str
='valid'
) →ItemLists
Tests found forsplit_by_folder
:
Some other tests where split_by_folder
is used:
pytest -sv tests/test_data_block.py::test_wrong_order
[source]
To run tests please refer to this guide.
Split the data depending on the folder (train
or valid
) in which the filenames are.
Note: This method looks at the folder immediately after self.path
for valid
and train
.
Basically, split_by_folder
takes in two folder names (‘train’ and ‘valid’ in the following example), to split il
the large ImageList
into two smaller ImageList
s, one for training set and the other for validation set. Both ImageList
s are attached to a large ItemLists
which is the final output of split_by_folder
.
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]
il = ItemList.from_folder(path=path_data); il
ItemList (1439 items)
/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/history.csv,/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv,/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png
Path: /home/ubuntu/.fastai/data/mnist_tiny
sd = il.split_by_folder(train='train', valid='valid'); sd
ItemLists;
Train: ItemList (713 items)
/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/train/3/9932.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/7189.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8498.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8888.png
Path: /home/ubuntu/.fastai/data/mnist_tiny;
Valid: ItemList (699 items)
/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png
Path: /home/ubuntu/.fastai/data/mnist_tiny;
Test: None
Behind the scenes, split_by_folder
uses _get_by_folder(name)
, to turn both ‘train’ and ‘valid’ folders into two list of indexes, and pass them onto split_by_idxs
to split il
into two ImageList
s, and finally attached to a ItemLists
.
train_idx = il._get_by_folder(name='train')
train_idx[:5], train_idx[-5:], len(train_idx)
([24, 25, 26, 27, 28], [732, 733, 734, 735, 736], 713)
valid_idx = il._get_by_folder(name='valid')
valid_idx[:5], valid_idx[-5:],len(valid_idx)
([740, 741, 742, 743, 744], [1434, 1435, 1436, 1437, 1438], 699)
By the way, _get_by_folder(name)
works in the following way, first, index the entire il.items
, loop every item and if an item belongs to the named folder, e.g., ‘train’, then put it into a list. The folder name
is the only input, and output is the list.
split_by_idx
[source][test]
split_by_idx
(valid_idx
:Collection
[int
]) →ItemLists
Tests found forsplit_by_idx
:
Some other tests where split_by_idx
is used:
pytest -sv tests/test_data_block.py::test_category
[source]pytest -sv tests/test_data_block.py::test_category_processor_existing_class
[source]pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class
[source]pytest -sv tests/test_data_block.py::test_multi_category
[source]
To run tests please refer to this guide.
Split the data according to the indexes in valid_idx
.
path = untar_data(URLs.MNIST_SAMPLE)
df = pd.read_csv(path/'labels.csv')
df.head()
name | label | |
---|---|---|
0 | train/3/7463.png | 0 |
1 | train/3/21102.png | 0 |
2 | train/3/31559.png | 0 |
3 | train/3/46882.png | 0 |
4 | train/3/26209.png | 0 |
You can pass a list of indices that you want to put in the validation set like [1, 3, 10]. Or you can pass a contiguous list like list(range(1000))
data = (ImageList.from_df(df, path)
.split_by_idx(list(range(1000))))
data
ItemLists;
Train: ImageList (13434 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample;
Valid: ImageList (1000 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample;
Test: None
split_by_idxs
[source][test]
split_by_idxs
(train_idx
,valid_idx
) No tests found forsplit_by_idxs
. To contribute a test please refer to this guide and this discussion.
Split the data between train_idx
and valid_idx
.
Behind the scenes, split_by_idxs
turns two index lists (train_idx
and valid_idx
) into two ImageList
s, and then pass onto split_by_list
to split il
into two ImageList
s and attach to a ItemLists
.
sd = il.split_by_idxs(train_idx=train_idx, valid_idx=valid_idx); sd
ItemLists;
Train: ItemList (713 items)
/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/train/3/9932.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/7189.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8498.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8888.png
Path: /home/ubuntu/.fastai/data/mnist_tiny;
Valid: ItemList (699 items)
/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png
Path: /home/ubuntu/.fastai/data/mnist_tiny;
Test: None
split_by_list
[source][test]
split_by_list
(train
,valid
) No tests found forsplit_by_list
. To contribute a test please refer to this guide and this discussion.
Split the data between train
and valid
.
split_by_list
takes in two ImageList
s which in the case below are il[train_idx]
and il[valid_idx]
, and pass them onto _split
(ItemLists
) to initialize an ItemLists
object, which basically takes in the training, valiation and testing (optionally) ImageList
s as its properties.
sd = il.split_by_list(train=il[train_idx], valid=il[valid_idx]); sd
ItemLists;
Train: ItemList (713 items)
/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/train/3/9932.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/7189.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8498.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8888.png
Path: /home/ubuntu/.fastai/data/mnist_tiny;
Valid: ItemList (699 items)
/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png
Path: /home/ubuntu/.fastai/data/mnist_tiny;
Test: None
This is more of an internal method, you should be using split_by_files
if you want to pass a list of filenames for the validation set.
split_by_valid_func
[source][test]
split_by_valid_func
(func
:Callable
) →ItemLists
No tests found forsplit_by_valid_func
. To contribute a test please refer to this guide and this discussion.
Split the data by result of func
(which returns True
for validation set).
split_from_df
[source][test]
split_from_df
(col
:IntsOrStrs
=2
) No tests found forsplit_from_df
. To contribute a test please refer to this guide and this discussion.
Split the data from the col
in the dataframe in self.inner_df
.
To use this function, you need a boolean column (default to the third column of the dataframe). The examples put in the validation set correspond to the indices with True
value in that column.
path = untar_data(URLs.MNIST_SAMPLE)
df = pd.read_csv(path/'labels.csv')
# Create a new column for is_valid
df['is_valid'] = [True]*(df.shape[0]//2) + [False]*(df.shape[0]//2)
# Randomly shuffle dataframe
df = df.reindex(np.random.permutation(df.index))
print(df.shape)
df.head()
(14434, 3)
name | label | is_valid | |
---|---|---|---|
2071 | train/3/28571.png | 0 | True |
9382 | train/7/24434.png | 1 | False |
6399 | train/7/56604.png | 1 | True |
130 | train/3/4740.png | 0 | True |
9226 | train/7/18876.png | 1 | False |
data = (ImageList.from_df(df, path)
.split_from_df())
data
ItemLists;
Train: ImageList (7217 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample;
Valid: ImageList (7217 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /home/ubuntu/.fastai/data/mnist_sample;
Test: None
Warning: This method assumes the data has been created from a csv file or a dataframe.
Step 3: Label the inputs
To label your inputs, use one of the following functions. Note that even if it’s not in the documented arguments, you can always pass a label_cls
that will be used to create those labels (the default is the one from your input ItemList
, and if there is none, it will go to CategoryList
, MultiCategoryList
or FloatList
depending on the type of the labels). This is implemented in the following function:
get_label_cls
[source][test]
get_label_cls
(labels
,label_cls
:Callable
=None
,label_delim
:str
=None
, **kwargs
) No tests found forget_label_cls
. To contribute a test please refer to this guide and this discussion.
Return label_cls
or guess one from the first element of labels
.
Behind the scenes, ItemList.get_label_cls
basically select a label class according to the item type of labels
, whereas labels
can be any of Collection
, pandas.core.frame.DataFrame
, pandas.core.series.Series
. If the list elements are of type string or integer, get_label_cls
will output CategoryList
; if they are of type float, then it will output FloatList
; if they are of type Collection, then it will output MultiCategoryList
.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid'); sd
ItemLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Valid: ImageList (699 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Test: None
labels = ['7', '3']
label_cls = sd.train.get_label_cls(labels); label_cls
fastai.data_block.CategoryList
labels = [7, 3]
label_cls = sd.train.get_label_cls(labels); label_cls
fastai.data_block.CategoryList
labels = [7.0, 3.0]
label_cls = sd.train.get_label_cls(labels); label_cls
fastai.data_block.FloatList
labels = [[7, 3],]
label_cls = sd.train.get_label_cls(labels); label_cls
fastai.data_block.MultiCategoryList
labels = [['7', '3'],]
label_cls = sd.train.get_label_cls(labels); label_cls
fastai.data_block.MultiCategoryList
If no label_cls
argument is passed, the correct labeling type can usually be inferred based on the data (for classification or regression). If you have multiple regression targets (e.g. predict 5 different numbers from a single image/text), be aware that arrays of floats are by default considered to be targets for one-hot encoded classification. If your task is regression, be sure the pass label_cls = FloatList
so that learners created from your databunch initialize correctly.
The first example in these docs created labels as follows:
path = untar_data(URLs.MNIST_TINY)
ll = ImageList.from_folder(path).split_by_folder().label_from_folder().train
If you want to save the data necessary to recreate your LabelList
(not including saving the actual image/text/etc files), you can use to_df
or to_csv
:
ll.train.to_csv('tmp.csv')
Or just grab a pd.DataFrame
directly:
ll.to_df().head()
x | y | |
---|---|---|
0 | train/7/9243.png | 7 |
1 | train/7/9519.png | 7 |
2 | train/7/7534.png | 7 |
3 | train/7/9082.png | 7 |
4 | train/7/8377.png | 7 |
label_empty
[source][test]
label_empty
(**kwargs
) No tests found forlabel_empty
. To contribute a test please refer to this guide and this discussion.
Label every item with an EmptyLabel
.
label_from_df
[source][test]
label_from_df
(cols
:IntsOrStrs
=1
,label_cls
:Callable
=None
, **kwargs
) Tests found forlabel_from_df
:
Some other tests where label_from_df
is used:
pytest -sv tests/test_data_block.py::test_category
[source]pytest -sv tests/test_data_block.py::test_category_processor_existing_class
[source]pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class
[source]pytest -sv tests/test_data_block.py::test_multi_category
[source]pytest -sv tests/test_data_block.py::test_regression
[source]
To run tests please refer to this guide.
Label self.items
from the values in cols
in self.inner_df
.
Warning: This method only works with data objects created with either from_csv
or from_df
methods.
label_const
[source][test]
label_const
(const
:Any
=0
,label_cls
:Callable
=None
, **kwargs
) →LabelList
Tests found forlabel_const
:
Some other tests where label_const
is used:
pytest -sv tests/test_data_block.py::test_split_subsets
[source]pytest -sv tests/test_data_block.py::test_splitdata_datasets
[source]
To run tests please refer to this guide.
Label every item with const
.
label_from_folder
[source][test]
label_from_folder
(label_cls
:Callable
=None
, **kwargs
) →LabelList
Tests found forlabel_from_folder
:
pytest -sv tests/test_text_data.py::test_filter_classes
[source]pytest -sv tests/test_text_data.py::test_from_folder
[source]
Some other tests where label_from_folder
is used:
pytest -sv tests/test_data_block.py::test_wrong_order
[source]
To run tests please refer to this guide.
Give a label to each filename depending on its folder.
Note: This method looks at the last subfolder in the path to determine the classes.
Behind the scenes, when an ItemList
calls label_from_folder
, it creates a lambda function which outputs a foldername which a file Path object immediately or directly belongs to, and then calls label_from_func
with the lambda function as input.
On the practical and high level, label_from_folder
is mostly used with ItemLists
rather than ItemList
for simplicity and efficiency, for details see the label_from_folder
example on ItemLists. Even when you just want a training set ItemList
, you still need to do split_none
to create an ItemLists
and then do labeling with label_from_folder
, as the example shown below.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]
sd_train = ImageList.from_folder(path_data/'train').split_none()
ll_train = sd_train.label_from_folder(); ll_train
LabelLists;
Train: LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Valid: LabelList (0 items)
x: ImageList
y: CategoryList
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Test: None
label_from_func
[source][test]
label_from_func
(func
:Callable
,label_cls
:Callable
=None
, **kwargs
) →LabelList
No tests found forlabel_from_func
. To contribute a test please refer to this guide and this discussion.
Apply func
to every input to get its label.
Inside label_from_func
, it applies the input func
to every item of an ItemList
and puts all the function outputs into a list, and then passes the list onto ItemList._label_from_list
. Below is a simple example of using label_from_func
.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid');sd
ItemLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Valid: ImageList (699 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Test: None
func=lambda o: (o.parts if isinstance(o, Path) else o.split(os.path.sep))[-2]
The lambda function above is to access the immediate foldername for a file Path object.
ll = sd.label_from_func(func); ll
LabelLists;
Train: LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Valid: LabelList (699 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Test: None
label_from_re
[source][test]
label_from_re
(pat
:str
,full_path
:bool
=False
,label_cls
:Callable
=None
, **kwargs
) →LabelList
No tests found forlabel_from_re
. To contribute a test please refer to this guide and this discussion.
Apply the re in pat
to determine the label of every filename. If full_path
, search in the full name.
class
CategoryList
[source][test]
CategoryList
(items
:Iterator
[T_co
],classes
:Collection
[T_co
]=None
,label_delim
:str
=None
, **kwargs
) ::CategoryListBase
No tests found forCategoryList
. To contribute a test please refer to this guide and this discussion.
Basic ItemList
for single classification labels.
ItemList
suitable for storing labels in items
belonging to classes
. If None
are passed, classes
will be determined by the unique different labels. processor
will default to CategoryProcessor
.
CategoryList
uses labels
to create an ItemList
for dealing with categorical labels. Behind the scenes, CategoryList
is a subclass of CategoryListBase
which is a subclass of ItemList
. CategoryList
inherits from CategoryListBase
the properties such as classes
(default as None
), filter_missing_y
(default as True
), and has its own unique property loss_func
(default as CrossEntropyFlat()
), and its own class attribute _processor
(default as CategoryProcessor
).
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
ll = ImageList.from_folder(path_data).split_by_folder('train', 'valid').label_from_folder()
ll.train.y.items, ll.train.y.classes, ll.train.y[0]
(array([1, 1, 1, 1, ..., 0, 0, 0, 0]), ['3', '7'], Category 7)
cl = CategoryList(ll.train.y.items, ll.train.y.classes); cl
CategoryList (709 items)
7,7,7,7,7
Path: .
For the behavior of printing out CategoryList
object or access an element using index, please see CategoryList.get
below.
Behind the scenes, CategoryList.get
is used inexplicitly when printing out the CategoryList
object or cl[idx]
. According to the source of CategoryList.get
, each item
is used to get its own class
. When ‘classes’ is a list of strings, then elements of items
are used as index of a list, therefore they must be integers in the range from 0 to len(classes)-1
; if classes
is a dictionary, then elements of items
are used as keys, therefore they can be strings too. See examples below for details.
from fastai.vision import *
items = np.array([0, 1, 2, 1, 0])
cl = CategoryList(items, classes=['3', '7', '9']); cl
CategoryList (5 items)
3,7,9,7,3
Path: .
items = np.array(['3', '7', '9', '7', '3'])
classes = {'3':3, '7':7, '9':9}
cl = CategoryList(items, classes); cl
CategoryList (5 items)
3,7,9,7,3
Path: .
class
MultiCategoryList
[source][test]
MultiCategoryList
(items
:Iterator
[T_co
],classes
:Collection
[T_co
]=None
,label_delim
:str
=None
,one_hot
:bool
=False
, **kwargs
) ::CategoryListBase
No tests found forMultiCategoryList
. To contribute a test please refer to this guide and this discussion.
Basic ItemList
for multi-classification labels.
It will store a list of labels in items
belonging to classes
. If None
are passed, classes
will be determined by the different unique labels. sep
is used to split the content of items
in a list of tags.
If one_hot=True
, the items contain the labels one-hot encoded. In this case, it is mandatory to pass a list of classes
(as we can’t use the different labels).
class
FloatList
[source][test]
FloatList
(items
:Iterator
[T_co
],log
:bool
=False
,classes
:Collection
[T_co
]=None
, **kwargs
) ::ItemList
No tests found forFloatList
. To contribute a test please refer to this guide and this discussion.
ItemList
suitable for storing the floats in items for regression. Will add a log
if this flag is True
.
class
EmptyLabelList
[source][test]
EmptyLabelList
(items
:Iterator
[T_co
],path
:PathOrStr
='.'
,label_cls
:Callable
=None
,inner_df
:Any
=None
,processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
,x
:ItemList
=None
,ignore_empty
:bool
=False
) ::ItemList
No tests found forEmptyLabelList
. To contribute a test please refer to this guide and this discussion.
Basic ItemList
for dummy labels.
Invisible step: preprocessing
This isn’t seen here in the API, but if you passed a processor
(or a list of them) in your initial ItemList
during step 1, it will be applied here. If you didn’t pass any processor, a list of them might still be created depending on what is in the _processor
variable of your class of items (this can be a list of PreProcessor
classes).
A processor is a transformation that is applied to all the inputs once at initialization, with a state computed on the training set that is then applied without modification on the validation set (and maybe the test set). For instance, it can be processing texts to tokenize then numericalize them. In that case we want the validation set to be numericalized with exactly the same vocabulary as the training set.
Another example is in tabular data, where we fill missing values with (for instance) the median computed on the training set. That statistic is stored in the inner state of the PreProcessor
and applied on the validation set.
This is the generic class for all processors.
class
PreProcessor
[source][test]
PreProcessor
(ds
:Collection
[T_co
]=None
) No tests found forPreProcessor
. To contribute a test please refer to this guide and this discussion.
Basic class for a processor that will be applied to items at the end of the data block API.
process_one
[source][test]
process_one
(item
:Any
) Tests found forprocess_one
:
Some other tests where process_one
is used:
pytest -sv tests/test_data_block.py::test_category_processor_existing_class
[source]pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class
[source]
To run tests please refer to this guide.
Process one item
. This method needs to be written in any subclass.
process
[source][test]
process
(ds
:Collection
[T_co
]) No tests found forprocess
. To contribute a test please refer to this guide and this discussion.
ds
: an object of ItemList
Process a dataset. This default to apply process_one
on every item
of ds
.
class
CategoryProcessor
[source][test]
CategoryProcessor
(ds
:ItemList
) ::PreProcessor
No tests found forCategoryProcessor
. To contribute a test please refer to this guide and this discussion.
PreProcessor
that create classes
from ds.items
and handle the mapping.
generate_classes
[source][test]
generate_classes
(items
) No tests found forgenerate_classes
. To contribute a test please refer to this guide and this discussion.
Generate classes from items
by taking the sorted unique values.
process
[source][test]
process
(ds
) No tests found forprocess
. To contribute a test please refer to this guide and this discussion.
ds
is an object of CategoryList
.
It basically generates a list of unique labels (assigned to ds.classes
) and a dictionary mapping classes
to indexes (assigned to ds.c2i
).
It is an internal function only called to apply processors to training, validation and testing datasets after the labeling step.
class
MultiCategoryProcessor
[source][test]
MultiCategoryProcessor
(ds
:ItemList
,one_hot
:bool
=False
) ::CategoryProcessor
No tests found forMultiCategoryProcessor
. To contribute a test please refer to this guide and this discussion.
PreProcessor
that create classes
from ds.items
and handle the mapping.
generate_classes
[source][test]
generate_classes
(items
) No tests found forgenerate_classes
. To contribute a test please refer to this guide and this discussion.
Generate classes from items
by taking the sorted unique values.
Optional steps
Add transforms
Transforms differ from processors in the sense they are applied on the fly when we grab one item. They also may change each time we ask for the same item in the case of random transforms.
transform
[source][test]
transform
(tfms
:Optional
[Tuple
[Union
[Callable
,Collection
[Callable
]],Union
[Callable
,Collection
[Callable
]]]]=(None, None)
, **kwargs
) No tests found fortransform
. To contribute a test please refer to this guide and this discussion.
Set tfms
to be applied to the xs of the train and validation set.
This is primary for the vision application. The kwargs
arguments are the ones expected by the type of transforms you pass. tfm_y
is among them and if set to True
, the transforms will be applied to input and target.
For examples see: vision.transforms.
Add a test set
To add a test set, you can use one of the two following methods.
add_test
[source][test]
add_test
(items
:Iterator
[T_co
],label
:Any
=None
,tfms
=None
,tfm_y
=None
) Tests found foradd_test
:
pytest -sv tests/test_data_block.py::test_add_test
[source]
To run tests please refer to this guide.
Add test set containing items
with an arbitrary label
.
Note: Here items
can be an ItemList
or a collection.
add_test_folder
[source][test]
add_test_folder
(test_folder
:str
='test'
,label
:Any
=None
,tfms
=None
,tfm_y
=None
) No tests found foradd_test_folder
. To contribute a test please refer to this guide and this discussion.
Add test set containing items from test_folder
and an arbitrary label
.
Warning: In fastai the test set is unlabeled! No labels will be collected even if they are available.
Instead, either the passed label
argument or an empty label will be used for all entries of this dataset (this is required by the internal pipeline of fastai).
In the fastai
framework test
datasets have no labels - this is the unknown data to be predicted. If you want to validate your model on a test
dataset with labels, you probably need to use it as a validation set, as in:
data_test = (ImageList.from_folder(path)
.split_by_folder(train='train', valid='test')
.label_from_folder()
...)
Another approach, where you do use a normal validation set, and then when the training is over, you just want to validate the test set w/ labels as a validation set, you can do this:
tfms = []
path = Path('data').resolve()
data = (ImageList.from_folder(path)
.split_by_pct()
.label_from_folder()
.transform(tfms)
.databunch()
.normalize() )
learn = cnn_learner(data, models.resnet50, metrics=accuracy)
learn.fit_one_cycle(5,1e-2)
# now replace the validation dataset entry with the test dataset as a new validation dataset:
# everything is exactly the same, except replacing `split_by_pct` w/ `split_by_folder`
# (or perhaps you were already using the latter, so simply switch to valid='test')
data_test = (ImageList.from_folder(path)
.split_by_folder(train='train', valid='test')
.label_from_folder()
.transform(tfms)
.databunch()
.normalize()
)
learn.validate(data_test.valid_dl)
Of course, your data block can be totally different, this is just an example.
Step 4: convert to a DataBunch
This last step is usually pretty straightforward. You just have to include all the arguments we pass to DataBunch.create
(bs
, num_workers
, collate_fn
). The class called to create a DataBunch
is set in the _bunch
attribute of the inputs of the training set if you need to modify it. Normally, the various subclasses we showed before handle that for you.
databunch
[source][test]
databunch
(path
:PathOrStr
=None
,bs
:int
=64
,val_bs
:int
=None
,num_workers
:int
=8
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,device
:device
=None
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
, **kwargs
) →DataBunch
Tests found fordatabunch
:
pytest -sv tests/test_vision_data.py::test_vision_datasets
[source]
Some other tests where databunch
is used:
pytest -sv tests/test_data_block.py::test_regression
[source]
To run tests please refer to this guide.
Create an DataBunch
from self, path
will override self.path
, kwargs
are passed to DataBunch.create
.
Inner classes
class
LabelList
[source][test]
LabelList
(x
:ItemList
,y
:ItemList
,tfms
:Union
[Callable
,Collection
[Callable
]]=None
,tfm_y
:bool
=False
, **kwargs
) ::Dataset
Tests found forLabelList
:
Some other tests where LabelList
is used:
pytest -sv tests/test_data_block.py::test_add_test
[source]
To run tests please refer to this guide.
A list of inputs x
and labels y
with optional tfms
.
Optionally apply tfms
to y
if tfm_y
is True
.
Behind the scenes, it takes inputs ItemList
and labels ItemList
as its properties x
and y
, sets property item
to None
, and uses LabelList.transform
to apply a list of transforms TfmList
to x
and y
if tfm_y
is set True
.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
ll = ImageList.from_folder(path_data).split_by_folder('train', 'valid').label_from_folder()
ll.train.x, ll.train.y
(ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny, CategoryList (709 items)
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny)
LabelList(x=ll.train.x, y=ll.train.y)
LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny
export
[source][test]
export
(fn
:PathOrStr
, **kwargs
) No tests found forexport
. To contribute a test please refer to this guide and this discussion.
Export the minimal state and save it in fn
to load an empty version for inference.
transform_y
[source][test]
transform_y
(tfms
:Union
[Callable
,Collection
[Callable
]]=None
, **kwargs
) No tests found fortransform_y
. To contribute a test please refer to this guide and this discussion.
Set tfms
to be applied to the targets only.
get_state
[source][test]
get_state
(**kwargs
) No tests found forget_state
. To contribute a test please refer to this guide and this discussion.
Return the minimal state for export.
load_empty
[source][test]
load_empty
(path
:PathOrStr
,fn
:PathOrStr
) No tests found forload_empty
. To contribute a test please refer to this guide and this discussion.
Load the state in fn
to create an empty LabelList
for inference.
load_state
[source][test]
load_state
(path
:PathOrStr
,state
:dict
) →LabelList
No tests found forload_state
. To contribute a test please refer to this guide and this discussion.
Create a LabelList
from state
.
process
[source][test]
process
(xp
:PreProcessor
=None
,yp
:PreProcessor
=None
,name
:str
=None
,max_warn_items
:int
=5
) No tests found forprocess
. To contribute a test please refer to this guide and this discussion.
Launch the processing on self.x
and self.y
with xp
and yp
.
Behind the scenes, LabelList.process
does 3 three things: 1. ask labels y
to be processed by yp
with y.process(yp)
; 2. if y.filter_missing_y
is True
, then removes the missing data samples from x
and y
; 3. ask inputs x
to be processed by xp
with x.process(xp)
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid')
sd.train = sd.train.label_from_folder(from_item_lists=True)
sd.valid = sd.valid.label_from_folder(from_item_lists=True)
sd.__class__ = LabelLists
xp,yp = sd.get_processors()
xp,yp
([], [<fastai.data_block.CategoryProcessor at 0x1a23757a90>])
sd.train.process(xp, yp)
LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny
set_item
[source][test]
set_item
(item
) No tests found forset_item
. To contribute a test please refer to this guide and this discussion.
For inference, will briefly replace the dataset with one that only contains item
.
to_df
[source][test]
to_df
() No tests found forto_df
. To contribute a test please refer to this guide and this discussion.
Create pd.DataFrame
containing items
from self.x
and self.y
.
to_csv
[source][test]
to_csv
(dest
:str
) No tests found forto_csv
. To contribute a test please refer to this guide and this discussion.
Save self.to_df()
to a CSV file in self.path
/dest
.
transform
[source][test]
transform
(tfms
:Union
[Callable
,Collection
[Callable
]],tfm_y
:bool
=None
, **kwargs
) No tests found fortransform
. To contribute a test please refer to this guide and this discussion.
Set the tfms
and tfm_y
value to be applied to the inputs and targets.
class
ItemLists
[source][test]
ItemLists
(path
:PathOrStr
,train
:ItemList
,valid
:ItemList
) No tests found forItemLists
. To contribute a test please refer to this guide and this discussion.
An ItemList
for each of train
and valid
(optional test
).
It initializes an ItemLists
object, which basically brings in the training, valiation and testing (optionally) ItemList
s as its properties. It also offers helpful warning messages on situations when the training or validation ItemList
is empty.
See the following example for how to create an ItemLists
object.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]
il_train = ImageList.from_folder(path_data/'train')
il_valid = ImageList.from_folder(path_data/'valid')
il_test = ImageList.from_folder(path_data/'test')
ils = ItemLists(path=path_data, train=il_train, valid=il_valid); ils
ItemLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Valid: ImageList (699 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/valid;
Test: None
ils.test = il_test; ils
ItemLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Valid: ImageList (699 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/valid;
Test: ImageList (20 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/test
However, we are most likely to see an ItemLists
, right after a large ItemList
is splitted and turned into an ItemLists
by methods like ItemList.split_by_folder
. Then, we will add labels to all training and validation simply using sd.label_from_folder()
(sd
is an ItemLists
, see example below). Now, some of you may be surprised because label_from_folder
is a method of ItemList
not ItemLists
. Well, this is a magic of fastai data_block api.
With the following example, we may understand a little better how to get labelling done by calling ItemLists.__getattr__
with ItemList.label_from_folder
.
il = ImageList.from_folder(path_data); il
ImageList (1428 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny
An ItemList
or its subclass object must do a split to turn itself into an ItemLists
before doing labeling to become a LabelLists
object.
sd = il.split_by_folder(train='train', valid='valid'); sd
ItemLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Valid: ImageList (699 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Test: None
ll = sd.label_from_folder(); ll
LabelLists;
Train: LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Valid: LabelList (699 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Test: None
Even when there is just an ImageList
from a training set folder with no split needed, we still must do split_none()
in order to create an ItemLists
, and only then we can do ItemLists.label_from_folder()
nicely.
il_train = ImageList.from_folder(path_data/'train')
sd_train = il_train.split_none(); sd_train
ItemLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Valid: ImageList (0 items)
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Test: None
ll_valid_empty = sd_train.label_from_folder(); ll_valid_empty
LabelLists;
Train: LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Valid: LabelList (0 items)
x: ImageList
y: CategoryList
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Test: None
So practially, although label_from_folder
is not an ItemLists
method, we can call ItemLists.label_from_folder()
to label training, validation and test ItemList
s once for all.
Behind the scenes, ItemLists.label_from_folder()
actually calls ItemLists.__getattr__('label_from_folder')
, in which all training, validation even testing ItemList
get to call label_from_folder
, and then turns the ItemLists
into a LabelLists
and calls LabelLists.process
at last.
You can directly use LabelLists.__getattr__
to do labelling as below.
ld_inner = sd.__getattr__('label_from_folder'); ld_inner()
LabelLists;
Train: LabelList (709 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Valid: LabelList (699 items)
x: ImageList
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
y: CategoryList
7,7,7,7,7
Path: /Users/Natsume/.fastai/data/mnist_tiny;
Test: None
label_from_lists
[source][test]
label_from_lists
(train_labels
:Iterator
[T_co
],valid_labels
:Iterator
[T_co
],label_cls
:Callable
=None
, **kwargs
) →LabelList
No tests found forlabel_from_lists
. To contribute a test please refer to this guide and this discussion.
Use the labels in train_labels
and valid_labels
to label the data. label_cls
will overwrite the default.
transform
[source][test]
transform
(tfms
:Optional
[Tuple
[Union
[Callable
,Collection
[Callable
]],Union
[Callable
,Collection
[Callable
]]]]=(None, None)
, **kwargs
) No tests found fortransform
. To contribute a test please refer to this guide and this discussion.
Set tfms
to be applied to the xs of the train and validation set.
transform_y
[source][test]
transform_y
(tfms
:Optional
[Tuple
[Union
[Callable
,Collection
[Callable
]],Union
[Callable
,Collection
[Callable
]]]]=(None, None)
, **kwargs
) No tests found fortransform_y
. To contribute a test please refer to this guide and this discussion.
Set tfms
to be applied to the ys of the train and validation set.
class
LabelLists
[source][test]
LabelLists
(path
:PathOrStr
,train
:ItemList
,valid
:ItemList
) ::ItemLists
Tests found forLabelLists
:
Some other tests where LabelLists
is used:
pytest -sv tests/test_data_block.py::test_add_test
[source]
To run tests please refer to this guide.
A LabelList
for each of train
and valid
(optional test
).
Creating a LabelLists
object is exactly the same way as creating an ItemLists
object, because its base class is ItemLists
and does not overwrite ItemLists.__init__
. The example below shows how to build a LabelLists
object.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY); path_data.ls()
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]
il_train = ImageList.from_folder(path_data/'train')
il_valid = ImageList.from_folder(path_data/'valid')
ll_test = LabelLists(path_data, il_train, il_valid);
ll_test.test = il_valid = ImageList.from_folder(path_data/'test')
ll_test
LabelLists;
Train: ImageList (709 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/train;
Valid: ImageList (699 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/valid;
Test: ImageList (20 items)
Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)
Path: /Users/Natsume/.fastai/data/mnist_tiny/test
get_processors
[source][test]
get_processors
() No tests found forget_processors
. To contribute a test please refer to this guide and this discussion.
Read the default class processors if none have been set.
Behind the scenes, LabelLists.get_processors()
first puts train.x._processor
classes and train.y._processor
classes into separate lists, and then instantiates those processors and puts them into xp
and yp
.
from fastai.vision import *
path_data = untar_data(URLs.MNIST_TINY)
sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid')
sd.train = sd.train.label_from_folder(from_item_lists=True)
sd.valid = sd.valid.label_from_folder(from_item_lists=True)
sd.__class__ = LabelLists
xp,yp = sd.get_processors()
xp,yp
load_empty
[source][test]
load_empty
(path
:PathOrStr
,fn
:PathOrStr
='export.pkl'
) No tests found forload_empty
. To contribute a test please refer to this guide and this discussion.
Create a LabelLists
with empty sets from the serialized file in path/fn
.
load_state
[source][test]
load_state
(path
:PathOrStr
,state
:dict
) No tests found forload_state
. To contribute a test please refer to this guide and this discussion.
Create a LabelLists
with empty sets from the serialized state
.
process
[source][test]
process
() No tests found forprocess
. To contribute a test please refer to this guide and this discussion.
Process the inner datasets.
show_doc(ItemList.process)
process
[source][test]
process
(processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=None
) No tests found forprocess
. To contribute a test please refer to this guide and this discussion.
Apply processor
or self.processor
to self
.
processor
is one or more PreProcessors
objects
Behind the scenes, we put all of processor
into a list and apply them all to an object of ItemList
or its subclasses.
Helper functions
get_files
[source][test]
get_files
(path
:PathOrStr
,extensions
:StrList
=None
,recurse
:bool
=False
,exclude
:OptStrList
=None
,include
:OptStrList
=None
,presort
:bool
=False
,followlinks
:bool
=False
) →FilePathList
No tests found forget_files
. To contribute a test please refer to this guide and this discussion.
Return list of files in path
that have a suffix in extensions
; optionally recurse
.
To be more precise, this function returns a list of FilePath objects using files in path
that must have a suffix in extensions
, and hidden folders and files are ignored. If recurse=True
, all files in subfolders will be applied; include
is used to select particular folders to apply.
Inside get_files
, there is _get_files
which turns all filenames inside f
from directory parent/p
into a list of FilePath objects. All filenames must have a suffix in extensions
. All hidden files are ignored.
path_data = untar_data(URLs.MNIST_TINY)
path_data.ls()
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]
With recurse=False
, no subfolder files are made available.
list_FilePath_noRecurse = get_files(path_data)
list_FilePath_noRecurse
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv')]
With recurse=True
, all subfolder files are made available, except hidden files.
list_FilePath_recurse = get_files(path_data, recurse=True)
list_FilePath_recurse[:3]
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid/7/9294.png')]
list_FilePath_recurse[-2:]
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train/3/7263.png'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train/3/7288.png')]
With extensions=['.csv']
, only files with the suffix of .csv
are made available.
list_FilePath_recurse_csv = get_files(path_data, recurse=True, extensions=['.csv'])
list_FilePath_recurse_csv
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv')]
With include=['test']
, only files in path_data
and its subfolder test
are made available.
list_FilePath_include = get_files(path_data, recurse=True, extensions=['.png','.jpg','.jpeg'],
include=['test'])
list_FilePath_include[:3]
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/4605.png'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/617.png'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/205.png')]
list_FilePath_include[-3:]
[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/1605.png'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/2642.png'),
PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/5071.png')]
©2021 fast.ai. All rights reserved.
Site last generated: Jan 5, 2021