basic_data
Basic classes to contain the data for model training.
Get your data ready for training
This module defines the basic DataBunch
object that is used inside Learner
to train a model. This is the generic class, that can take any kind of fastai Dataset
or DataLoader
. You’ll find helpful functions in the data module of every application to directly create this DataBunch
for you.
class
DataBunch
[source][test]
DataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,fix_dl
:DataLoader
=None
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
) Tests found forDataBunch
:
pytest -sv tests/test_data_block.py::test_custom_dataset
[source]
Some other tests where DataBunch
is used:
pytest -sv tests/test_basic_data.py::test_DataBunch_Create
[source]pytest -sv tests/test_basic_data.py::test_DataBunch_no_valid_dl
[source]pytest -sv tests/test_basic_data.py::test_DataBunch_save_load
[source]
To run tests please refer to this guide.
Bind train_dl
,valid_dl
and test_dl
in a data object.
It also ensures all the dataloaders are on device
and applies to them dl_tfms
as batch are drawn (like normalization). path
is used internally to store temporary files, collate_fn
is passed to the pytorch Dataloader
(replacing the one there) to explain how to collate the samples picked for a batch. By default, it applies data to the object sent (see in vision.image
or the data block API why this can be important).
train_dl
, valid_dl
and optionally test_dl
will be wrapped in DeviceDataLoader
.
Factory method
create
[source][test]
create
(train_ds
:Dataset
,valid_ds
:Dataset
,test_ds
:Optional
[Dataset
]=None
,path
:PathOrStr
='.'
,bs
:int
=64
,val_bs
:int
=None
,num_workers
:int
=8
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,device
:device
=None
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
, **dl_kwargs
) →DataBunch
Tests found forcreate
:
pytest -sv tests/test_basic_data.py::test_DataBunch_Create
[source]pytest -sv tests/test_basic_data.py::test_DataBunch_no_valid_dl
[source]
Some other tests where create
is used:
pytest -sv tests/test_basic_data.py::test_DeviceDataLoader_getitem
[source]
To run tests please refer to this guide.
Create a DataBunch
from train_ds
, valid_ds
and maybe test_ds
with a batch size of bs
. Passes **dl_kwargs
to DataLoader()
num_workers
is the number of CPUs to use, tfms
, device
and collate_fn
are passed to the init method.
Warning: You can pass regular pytorch Dataset here, but they’ll require more attributes than the basic ones to work with the library. See below for more details.
Visualization
show_batch
[source][test]
show_batch
(rows
:int
=5
,ds_type
:DatasetType
=<DatasetType.Train: 1>
,reverse
:bool
=False
, **kwargs
) Tests found forshow_batch
:
pytest -sv tests/test_basic_data.py::test_DataBunch_show_batch
[source]
To run tests please refer to this guide.
Show a batch of data in ds_type
on a few rows
.
Grabbing some data
dl
[source][test]
dl
(ds_type
:DatasetType
=<DatasetType.Valid: 2>
) →DeviceDataLoader
No tests found fordl
. To contribute a test please refer to this guide and this discussion.
Returns an appropriate DataLoader
with a dataset for validation, training, or test (ds_type
).
one_batch
[source][test]
one_batch
(ds_type
:DatasetType
=<DatasetType.Train: 1>
,detach
:bool
=True
,denorm
:bool
=True
,cpu
:bool
=True
) →Collection
[Tensor
] Tests found forone_batch
:
pytest -sv tests/test_basic_data.py::test_DataBunch_onebatch
[source]pytest -sv tests/test_basic_data.py::test_DataBunch_save_load
[source]pytest -sv tests/test_text_data.py::test_backwards_cls_databunch
[source]pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1
[source]pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2
[source]
To run tests please refer to this guide.
Get one batch from the data loader of ds_type
. Optionally detach
and denorm
.
one_item
[source][test]
one_item
(item
,detach
:bool
=False
,denorm
:bool
=False
,cpu
:bool
=False
) Tests found forone_item
:
pytest -sv tests/test_basic_data.py::test_DataBunch_oneitem
[source]
To run tests please refer to this guide.
Get item
into a batch. Optionally detach
and denorm
.
sanity_check
[source][test]
sanity_check
() No tests found forsanity_check
. To contribute a test please refer to this guide and this discussion.
Check the underlying data in the training set can be properly loaded.
Load and save
You can save your DataBunch
object for future use with this method.
save
[source][test]
save
(file
:PathLikeOrBinaryStream
='data_save.pkl'
) Tests found forsave
:
pytest -sv tests/test_basic_data.py::test_DataBunch_save_load
[source]
To run tests please refer to this guide.
Save the DataBunch
in self.path/file
. file
can be file-like (file or buffer)
load_data
[source][test]
load_data
(path
:PathOrStr
,file
:PathLikeOrBinaryStream
='data_save.pkl'
,bs
:int
=64
,val_bs
:int
=None
,num_workers
:int
=8
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,device
:device
=None
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
, **kwargs
) →DataBunch
Tests found forload_data
:
pytest -sv tests/test_basic_data.py::test_DataBunch_save_load
[source]pytest -sv tests/test_text_data.py::test_load_and_save_test
[source]
To run tests please refer to this guide.
Load a saved DataBunch
from path/file
. file
can be file-like (file or buffer)
Important: The arguments you passed when you created your first DataBunch
aren’t saved, so you should pass them here if you don’t want the default.
Note: Data cannot be serialized on Windows and then loaded on Linux or vice versa because Path
object doesn’t support this. We will find a workaround for that in v2.
This is to allow you to easily create a new DataBunch
with a different batch size for instance. You will also need to reapply any normalization (in vision) you might have done on your original DataBunch
.
Empty DataBunch
for inference
export
[source][test]
export
(file
:PathLikeOrBinaryStream
='export.pkl'
) No tests found forexport
. To contribute a test please refer to this guide and this discussion.
Export the minimal state of self
for inference in self.path/file
. file
can be file-like (file or buffer)
load_empty
[source][test]
load_empty
(path
,fname
:str
='export.pkl'
) No tests found for_databunch_load_empty
. To contribute a test please refer to this guide and this discussion.
Load an empty DataBunch
from the exported file in path/fname
with optional tfms
.
This method should be used to create a DataBunch
at inference, see the corresponding tutorial.
add_test
[source][test]
add_test
(items
:Iterator
[T_co
],label
:Any
=None
,tfms
=None
,tfm_y
=None
) No tests found foradd_test
. To contribute a test please refer to this guide and this discussion.
Add the items
as a test set. Pass along label
otherwise label them with EmptyLabel
.
Dataloader transforms
add_tfm
[source][test]
add_tfm
(tfm
:Callable
) No tests found foradd_tfm
. To contribute a test please refer to this guide and this discussion.
Adds a transform to all dataloaders.
Using a custom Dataset in fastai
If you want to use your pytorch Dataset
in fastai, you may need to implement more attributes/methods if you want to use the full functionality of the library. Some functions can easily be used with your pytorch Dataset
if you just add an attribute, for others, the best would be to create your own ItemList
by following this tutorial. Here is a full list of what the library will expect.
Basics
First of all, you obviously need to implement the methods __len__
and __getitem__
, as indicated by the pytorch docs. Then the most needed things would be:
c
attribute: it’s used in most functions that directly create aLearner
(tabular_learner
,text_classifier_learner
,unet_learner
,cnn_learner
) and represents the number of outputs of the final layer of your model (also the number of classes if applicable).classes
attribute: it’s used byClassificationInterpretation
and also incollab_learner
(best to useCollabDataBunch.from_df
than a pytorchDataset
) and represents the unique tags that appear in your data.- maybe a
loss_func
attribute: that is going to be used byLearner
as a default loss function, so if you know your customDataset
requires a particular loss, you can put it.
Toy example with image-like numpy arrays and binary label
class ArrayDataset(Dataset):
"Sample numpy array dataset"
def __init__(self, x, y):
self.x, self.y = x, y
self.c = 2 # binary label
def __len__(self):
return len(self.x)
def __getitem__(self, i):
return self.x[i], self.y[i]
train_x = np.random.rand(10, 3, 3) # 10 images (3x3)
train_y = np.random.rand(10, 1).round() # binary label
valid_x = np.random.rand(10, 3, 3)
valid_y = np.random.rand(10, 1).round()
train_ds, valid_ds = ArrayDataset(train_x, train_y), ArrayDataset(valid_x, valid_y)
data = DataBunch.create(train_ds, valid_ds, bs=2, num_workers=1)
data.one_batch()
(tensor([[[0.8053, 0.5914, 0.5369],
[0.6880, 0.4680, 0.5457],
[0.0051, 0.2096, 0.3469]],
[[0.5170, 0.2542, 0.9869],
[0.0176, 0.5049, 0.4417],
[0.3495, 0.7276, 0.5426]]], dtype=torch.float64), tensor([[0.],
[0.]], dtype=torch.float64))
For a specific application
In text, your dataset will need to have a vocab
attribute that should be an instance of Vocab
. It’s used by text_classifier_learner
and language_model_learner
when building the model.
In tabular, your dataset will need to have a cont_names
attribute (for the names of continuous variables) and a get_emb_szs
method that returns a list of tuple (n_classes, emb_sz)
representing, for each categorical variable, the number of different codes (don’t forget to add 1 for nan) and the corresponding embedding size. Those two are used with the c
attribute by tabular_learner
.
Functions that really won’t work
To make those last functions work, you really need to use the data block API and maybe write your own custom ItemList.
DataBunch.show_batch
(requires.x.reconstruct
,.y.reconstruct
and.x.show_xys
)Learner.predict
(requiresx.set_item
,.y.analyze_pred
,.y.reconstruct
and maybe.x.reconstruct
)Learner.show_results
(requiresx.reconstruct
,y.analyze_pred
,y.reconstruct
andx.show_xyzs
)DataBunch.set_item
(requiresx.set_item
)Learner.backward
(usesDataBunch.set_item
)DataBunch.export
(requiresexport
)
class
DeviceDataLoader
[source][test]
DeviceDataLoader
(dl
:DataLoader
,device
:device
,tfms
:List
[Callable
]=None
,collate_fn
:Callable
='data_collate'
) Tests found forDeviceDataLoader
:
Some other tests where DeviceDataLoader
is used:
pytest -sv tests/test_basic_data.py::test_DeviceDataLoader_getitem
[source]
To run tests please refer to this guide.
Bind a DataLoader
to a torch.device
.
Put the batches of dl
on device
after applying an optional list of tfms
. collate_fn
will replace the one of dl
. All dataloaders of a DataBunch
are of this type.
Factory method
create
[source][test]
create
(dataset
:Dataset
,bs
:int
=64
,shuffle
:bool
=False
,device
:device
=device(type='cpu')
,tfms
:Collection
[Callable
]=None
,num_workers
:int
=8
,collate_fn
:Callable
='data_collate'
, **kwargs
:Any
) Tests found forcreate
:
Some other tests where create
is used:
pytest -sv tests/test_basic_data.py::test_DataBunch_Create
[source]pytest -sv tests/test_basic_data.py::test_DataBunch_no_valid_dl
[source]pytest -sv tests/test_basic_data.py::test_DeviceDataLoader_getitem
[source]
To run tests please refer to this guide.
Create DeviceDataLoader from dataset
with bs
and shuffle
: process using num_workers
.
The given collate_fn
will be used to put the samples together in one batch (by default it grabs their data attribute). shuffle
means the dataloader will take the samples randomly if that flag is set to True
, or in the right order otherwise. tfms
are passed to the init method. All kwargs
are passed to the pytorch DataLoader
class initialization.
Methods
add_tfm
[source][test]
add_tfm
(tfm
:Callable
) No tests found foradd_tfm
. To contribute a test please refer to this guide and this discussion.
Add tfm
to self.tfms
.
remove_tfm
[source][test]
remove_tfm
(tfm
:Callable
) No tests found forremove_tfm
. To contribute a test please refer to this guide and this discussion.
Remove tfm
from self.tfms
.
new
[source][test]
new
(**kwargs
) No tests found fornew
. To contribute a test please refer to this guide and this discussion.
Create a new copy of self
with kwargs
replacing current values.
proc_batch
[source][test]
proc_batch
(b
:Tensor
) →Tensor
No tests found forproc_batch
. To contribute a test please refer to this guide and this discussion.
Process batch b
of TensorImage
.
DatasetType
[test]
Enum
= [Train, Valid, Test, Single, Fix] No tests found forDatasetType
. To contribute a test please refer to this guide and this discussion.
Internal enumerator to name the training, validation and test dataset/dataloader.
Open This Notebook
Open in GCP Notebooks
©2021 fast.ai. All rights reserved.
Site last generated: Jan 5, 2021