datasets
This module has the necessary functions to be able to download several useful datasets that we might be interested in using in our models.
class
URLs
[source][test]
URLs
() Tests found forURLs
:
Some other tests where URLs
is used:
pytest -sv tests/test_datasets.py::test_user_config
[source]
To run tests please refer to this guide.
Global constants for dataset and model URLs.
This contains all the datasets’ and models’ URLs, and some classmethods to help use them - you don’t create objects of this class. The supported datasets are (with their calling name): S3_NLP
, S3_COCO
, MNIST_SAMPLE
, MNIST_TINY
, IMDB_SAMPLE
, ADULT_SAMPLE
, ML_SAMPLE
, PLANET_SAMPLE
, CIFAR
, PETS
, MNIST
. To get details on the datasets you can see the fast.ai datasets webpage. Datasets with SAMPLE in their name are subsets of the original datasets. In the case of MNIST, we also have a TINY dataset which is even smaller than MNIST_SAMPLE.
Models is now limited to WT103
but you can expect more in the future!
URLs.MNIST_SAMPLE
'http://files.fast.ai/data/examples/mnist_sample'
Downloading Data
For the rest of the datasets you will need to download them with untar_data
or download_data
. untar_data
will decompress the data file and download it while download_data
will just download and save the compressed file in .tgz
format.
The locations where the data and models are downloaded are set in config.yml
, which by default is located in ~/.fastai
. This directory can be changed via the optional environment variable FASTAI_HOME
(e.g FASTAI_HOME=/home/.fastai).
If no config.yml
is present in the specified directory, a default one will be created with data_archive_path
, data_path
and models_path
entries. The data_path
and models_path
entries point respectively to data
folder and models
folder in the same directory as config.yml
. The data_archive_path
allows you to set a separate folder to save compressed datasets for archiving purposes. It defaults to the same directory as data_path
.
Configure those download locations by editing data_archive_path
, data_path
and models_path
in config.yml
.
untar_data
[source][test]
untar_data
(url
:str
,fname
:PathOrStr
=None
,dest
:PathOrStr
=None
,data
=True
,force_download
=False
,verbose
=False
) →Path
Tests found foruntar_data
:
pytest -sv tests/test_datasets.py::test_load_config
[source]pytest -sv tests/test_datasets.py::test_user_config
[source]pytest -sv tests/test_vision_data.py::test_trunc_download
[source]
Some other tests where untar_data
is used:
pytest -sv tests/test_datasets.py::test_user_config
[source]
To run tests please refer to this guide.
Download url
to fname
if dest
doesn’t exist, and un-tgz to folder dest
.
In general, untar_data
uses a url
to download a tgz
file under fname
, and then un-tgz fname
into a folder under dest
.
If you have run untar_data
before, then running untar_data(URLs.something)
again will just return you dest
without downloading again.
If you have run untar_data
before, then running untar_data
again with force_download=True
or the tgz file under fname
being corrupted somehow, will remove the existing fname
and dest
and start downloading again.
If you have run untar_data
before, but dest
does not exist, meaning no folder under dest
exists (the folder could be removed or renamed somehow), then running untar_data(URLs.something)
again will execute download_data
. Furthermore, if the tgz file under fname
does exist, then there will be no actual downloading rather than un-tgz fname
into dest
; if fname
does not exist, then downloading for the tgz file will be actually executed.
Note: the url
you feed to untar_data
must be one of URLs.something
.
untar_data(URLs.PLANET_SAMPLE)
PosixPath('/home/ubuntu/.fastai/data/planet_sample')
download_data
[source][test]
download_data
(url
:str
,fname
:PathOrStr
=None
,data
:bool
=True
,ext
:str
='.tgz'
) →Path
Tests found fordownload_data
:
pytest -sv tests/test_datasets.py::test_load_config
[source]pytest -sv tests/test_datasets.py::test_user_config
[source]
Some other tests where download_data
is used:
pytest -sv tests/test_datasets.py::test_user_config
[source]
To run tests please refer to this guide.
Download url
to destination fname
.
Note: If the data file already exists in a data
directory inside the notebook, that data file will be used instead of the one present in the folder specified in config.yml
. config.yml
is located in the directory specified in optional environment variable FASTAI_HOME
(defaults to ~/.fastai/
). Paths are resolved by calling the function datapath4file
- which checks if data exists locally (data/
) first, before downloading to the folder specified in config.yml
.
Example:
download_data(URLs.PLANET_SAMPLE)
PosixPath('/home/ubuntu/.fastai/data/planet_sample.tgz')
datapath4file
[source][test]
datapath4file
(filename
:str
,ext
:str
='.tgz'
,archive
=True
) Tests found fordatapath4file
:
pytest -sv tests/test_datasets.py::test_load_config
[source]pytest -sv tests/test_datasets.py::test_user_config
[source]
To run tests please refer to this guide.
Return data path to filename
, checking locally first then in the config file.
All the downloading functions use this to decide where to put the tgz and expanded folder. If filename
already exists in a data
directory in the same place as the calling notebook/script, that is used as the parent directly; otherwise, config.yml
is read to see what path to use, which defaults to ~/.fastai/data
is used. To override this default, simply modify the value in your config.yml
:
data_archive_path: ~/.fastai/data
data_path: ~/.fastai/data
config.yml
is located in the directory specified in the optional environment variable FASTAI_HOME
(defaults to ~/.fastai/
).
url2path
[source][test]
url2path
(url
,data
=True
,ext
:str
='.tgz'
) Tests found forurl2path
:
pytest -sv tests/test_datasets.py::test_load_config
[source]pytest -sv tests/test_datasets.py::test_user_config
[source]
To run tests please refer to this guide.
Change url
to a path.
class
Config
[source][test]
Config
() Tests found forConfig
:
pytest -sv tests/test_datasets.py::test_creates_config
[source]pytest -sv tests/test_datasets.py::test_default_config
[source]pytest -sv tests/test_datasets.py::test_load_config
[source]pytest -sv tests/test_datasets.py::test_user_config
[source]
Some other tests where Config
is used:
pytest -sv tests/test_datasets.py::test_user_config
[source]
To run tests please refer to this guide.
Creates a default config file ‘config.yml’ in $FASTAI_HOME (default ~/.fastai/
)
You probably won’t need to use this yourself - it’s used by URLs.datapath4file
.
get_path
[source][test]
get_path
(path
) No tests found forget_path
. To contribute a test please refer to this guide and this discussion.
Get the path
in the config file.
Get the key corresponding to path
in the Config
.
data_path
[source][test]
data_path
() No tests found fordata_path
. To contribute a test please refer to this guide and this discussion.
Get the path to data in the config file.
Get the Path
where the data is stored.
model_path
[source][test]
model_path
() No tests found formodel_path
. To contribute a test please refer to this guide and this discussion.
Get the path to fastai pretrained models in the config file.
©2021 fast.ai. All rights reserved.
Site last generated: Jan 5, 2021