External data
Helper functions to download the fastai datasets
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
A complete list of datasets that are available by default inside the library are:
Main datasets:
- ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.
- BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
- CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
- COCO_SAMPLE: A sample of the coco dataset for object detection.
- COCO_TINY: A tiny version of the coco dataset for object detection.
- HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.
IMDB: The full IMDB sentiment analysis dataset.
IMDB_SAMPLE: A sample of the full IMDB sentiment analysis dataset.
ML_SAMPLE: A movielens sample dataset for recommendation engines to recommend movies to users.
- ML_100k: The movielens 100k dataset for recommendation engines to recommend movies to users.
- MNIST_SAMPLE: A sample of the famous MNIST dataset consisting of handwritten digits.
- MNIST_TINY: A tiny version of the famous MNIST dataset consisting of handwritten digits.
- MNIST_VAR_SIZE_TINY:
- PLANET_SAMPLE: A sample of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space.
- PLANET_TINY: A tiny version of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space for faster experimentation and prototyping.
- IMAGENETTE: A smaller version of the imagenet dataset pronounced just like ‘Imagenet’, except with a corny inauthentic French accent.
- IMAGENETTE_160: The 160px version of the Imagenette dataset.
- IMAGENETTE_320: The 320px version of the Imagenette dataset.
- IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds.
- IMAGEWOOF_160: 160px version of the ImageWoof dataset.
- IMAGEWOOF_320: 320px version of the ImageWoof dataset.
- IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem
- IMAGEWANG_160: 160px version of Imagewang.
- IMAGEWANG_320: 320px version of Imagewang.
Kaggle competition datasets:
- DOGS: Image dataset consisting of dogs and cats images from Dogs vs Cats kaggle competition.
Image Classification datasets:
- CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc ‘Aurelio Ranzato.
- CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
- CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
- CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
- FLOWERS: 17 category flower dataset by gathering images from various websites.
- FOOD:
- MNIST: MNIST dataset consisting of handwritten digits.
- PETS: A 37 category pet dataset with roughly 200 images for each class.
NLP datasets:
- AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
- AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
- AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
- DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
- MT_ENG_FRA: Machine translation dataset from English to French.
- SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
- WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
- WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
- YAHOO_ANSWERS: YAHOO’s question answers dataset.
- YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
- YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.
Image localization datasets:
- BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
- CAMVID: Consists of driving labelled dataset for segmentation type models.
- CAMVID_TINY: A tiny camvid dataset for segmentation type models.
- LSUN_BEDROOMS: Large-scale Image Dataset using Deep Learning with Humans in the Loop
- PASCAL_2007: Pascal 2007 dataset to recognize objects from a number of visual object classes in realistic scenes.
- PASCAL_2012: Pascal 2012 dataset to recognize objects from a number of visual object classes in realistic scenes.
Audio classification:
- MACAQUES: 7285 macaque coo calls across 8 individuals from Distributed acoustic cues for caller identity in macaque vocalization.
- ZEBRA_FINCH: 3405 zebra finch calls classified across 11 call types. Additional labels include name of individual making the vocalization and its age.
Medical imaging datasets:
- SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.
TCGA_SMALL: A smaller version of the TCGA-OV dataset with subcutaneous and visceral fat segmentations. Citations:
Holback, C., Jarosz, R., Prior, F., Mutch, D. G., Bhosale, P., Garcia, K., … Erickson, B. J. (2016). Radiology Data from The Cancer Genome Atlas Ovarian Cancer [TCGA-OV] collection. The Cancer Imaging Archive. http://doi.org/10.7937/K9/TCIA.2016.NDO1MDFQ
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. https://link.springer.com/article/10.1007/s10278-013-9622-7
Pretrained models:
- OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
- WT103_FWD: The WikiText-103 forward language model weights.
- WT103_BWD: The WikiText-103 backward language model weights.
To download any of the datasets or pretrained weights, simply run untar_data
by passing any dataset name mentioned above like so:
path = untar_data(URLs.PETS)
path.ls()
> > (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),...]
To download model pretrained weights:```python path = untar_data(URLs.PETS) path.ls()
(#2) [Path(‘/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl’),Path(‘/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth’)] ```
Config
[source]
Config
(cfg_name
='settings.ini'
)
Reading and writing settings.ini
If a config file doesn’t exist already, it is always created at ~/.fastai/config.yml
location by default whenever an instance of the Config
class is created. Here is a quick example to explain:
config_file = Path("~/.fastai/config.yml").expanduser()
if config_file.exists(): os.remove(config_file)
assert not config_file.exists()
config = Config()
assert config_file.exists()
The config is now available as config.d
:
config.d
{'archive_path': '/home/jhoward/.fastai/archive',
'data_path': '/home/jhoward/.fastai/data',
'model_path': '/home/jhoward/.fastai/models',
'storage_path': '/tmp',
'version': 2}
As can be seen, this is a basic config file that consists of data_path
, model_path
, storage_path
and archive_path
. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data_path
while all pretrained model weights are download to model_path
unless the default download location is updated.
Please note that it is possible to update the default path locations in the config file. Let’s first create a backup of the config file, then update the config to show the changes and re update the new config with the backup file.
if config_file.exists(): shutil.move(config_file, config_bak)
config['archive_path'] = Path(".")
config.save()
config = Config()
config.d
{'archive_path': '.',
'data_archive_path': '/home/jhoward/.fastai/data',
'data_path': '/home/jhoward/.fastai/data',
'model_path': '/home/jhoward/.fastai/models',
'storage_path': '/tmp',
'version': 2}
The archive_path
has been updated to "."
. Now let’s remove any updates we made to Config file that we made for the purpose of this example.
if config_bak.exists(): shutil.move(config_bak, config_file)
config = Config()
config.d
{'archive_path': '/home/jhoward/.fastai/archive',
'data_archive_path': '/home/jhoward/.fastai/data',
'data_path': '/home/jhoward/.fastai/data',
'model_path': '/home/jhoward/.fastai/models',
'storage_path': '/tmp',
'version': 2}
class
URLs
[source]
URLs
()
Global constants for dataset and model URLs.
The default local path is at ~/.fastai/archive/
but this can be updated by passing a different c_key
. Note: c_key
should be one of 'archive_path', 'data_archive_path', 'data_path', 'model_path', 'storage_path'
.
url = URLs.PETS
local_path = URLs.path(url)
test_eq(local_path.parent, Config()['archive']);
local_path
Path('/home/jhoward/.fastai/archive/oxford-iiit-pet.tgz')
local_path = URLs.path(url, c_key='model')
test_eq(local_path.parent, Config()['model'])
local_path
Path('/home/jhoward/.fastai/models/oxford-iiit-pet.tgz')
Downloading
download_url
[source]
download_url
(url
,dest
,overwrite
=False
,pbar
=None
,show_progress
=True
,chunk_size
=1048576
,timeout
=4
,retries
=5
)
Download url
to dest
unless it exists and not overwrite
The download_url
is a very handy function inside fastai! This function can be used to download any file from the internet to a location passed by dest
argument of the function. It should not be confused, that this function can only be used to download fastai-files. That couldn’t be further away from the truth. As an example, let’s download the pets dataset from the actual source file:
fname = Path("./dog.jpg")
if fname.exists(): os.remove(fname)
url = "https://i.insider.com/569fdd9ac08a80bd448b7138?width=1100&format=jpeg&auto=webp"
download_url(url, fname)
assert fname.exists()
Let’s confirm that the file was indeed downloaded correctly.
from PIL import Image
im = Image.open(fname)
plt.imshow(im);
As can be seen, the file has been downloaded to the local path provided in dest
argument. Calling the function again doesn’t trigger a download since the file is already there. This can be confirmed by checking that the last modified time of the file that is downloaded doesn’t get updated.
if fname.exists(): last_modified_time = os.path.getmtime(fname)
download_url(url, fname)
test_eq(os.path.getmtime(fname), last_modified_time)
if fname.exists(): os.remove(fname)
We can also use the download_url
function to download the pet’s dataset straight from the source by simply passing https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
in url
.
download_data
[source]
download_data
(url
,fname
=None
,c_key
='archive'
,force_download
=False
,timeout
=4
)
Download url
to fname
.
The download_data
is a convenience function and a wrapper outside download_url
to download fastai files to the appropriate local path based on the c_key
.
If fname
is None, it will default to the archive folder you have in your config file (or data, model if you specify a different c_key
) followed by the last part of the url: for instance URLs.MNIST_SAMPLE
is http://files.fast.ai/data/examples/mnist_sample.tgz
and the default value for fname
will be ~/.fastai/archive/mnist_sample.tgz
.
If force_download=True
, the file is alwayd downloaded. Otherwise, it’s only when the file doesn’t exists that the download is triggered.
Extract
file_extract
[source]
file_extract
(fname
,dest
=None
)
Extract fname
to dest
using tarfile
or zipfile
.
file_extract
is used by default in untar_data
to decompress the downloaded file.
newest_folder
[source]
newest_folder
(path
)
Return newest folder on path
rename_extracted
[source]
rename_extracted
(dest
)
Rename file if different from dest
let’s rename the untar/unzip data if dest name is different from fname
untar_data
[source]
untar_data
(url
,fname
=None
,dest
=None
,c_key
='data'
,force_download
=False
,extract_func
=file_extract
,timeout
=4
)
Download url
to fname
if dest
doesn’t exist, and un-tgz or unzip to folder dest
.
untar_data
is a very powerful convenience function to download files from url
to dest
. The url
can be a default url
from the URLs
class or a custom url. If dest
is not passed, files are downloaded at the default_dest
which defaults to ~/.fastai/data/
.
This convenience function extracts the downloaded files to dest
by default. In order, to simply download the files without extracting, pass the noop
function as extract_func
.
Note, it is also possible to pass a custom extract_func
to untar_data
if the filetype doesn’t end with .tgz
or .zip
. The gzip
and zip
files are supported by default and there is no need to pass custom extract_func
for these type of files.
Internally, if files are not available at fname
location already which defaults to ~/.fastai/archive/
, the files get downloaded at ~/.fastai/archive
and are then extracted at dest
location. If no dest
is passed the default_dest
to download the files is ~/.fastai/data
. If files are already available at the fname
location but not available then a symbolic link is created for each file from fname
location to dest
.
Also, if force_download
is set to True
, files are re downloaded even if they exist.
from tempfile import TemporaryDirectory
test_eq(untar_data(URLs.MNIST_SAMPLE), config.data/'mnist_sample')
with TemporaryDirectory() as d:
d = Path(d)
dest = untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
assert Path('mnist_tiny.tgz').exists()
assert (d/'mnist_tiny').exists()
os.unlink('mnist_tiny.tgz')
#Test c_key
tst_model = config.model/'mnist_sample'
test_eq(untar_data(URLs.MNIST_SAMPLE, c_key='model'), tst_model)
assert not tst_model.with_suffix('.tgz').exists() #Archive wasn't downloaded in the models path
assert (config.archive/'mnist_sample.tgz').exists() #Archive was downloaded there
shutil.rmtree(tst_model)
Sometimes the extracted folder does not have the same name as the downloaded file.
with TemporaryDirectory() as d:
d = Path(d)
untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
Path('mnist_tiny.tgz').rename('nims_tini.tgz')
p = Path('nims_tini.tgz')
dest = Path('nims_tini')
assert p.exists()
file_extract(p, dest.parent)
rename_extracted(dest)
p.unlink()
shutil.rmtree(dest)
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021