External data
- Config [source]
class URLs [source]
Downloading
- download_url [source]
- download_data [source]
Extract

External data

Helper functions to download the fastai datasets

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

A complete list of datasets that are available by default inside the library are:

Main datasets:

ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.

BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.

CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
COCO_SAMPLE: A sample of the coco dataset for object detection.
COCO_TINY: A tiny version of the coco dataset for object detection.

HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.
IMDB: The full IMDB sentiment analysis dataset.
IMDB_SAMPLE: A sample of the full IMDB sentiment analysis dataset.
ML_SAMPLE: A movielens sample dataset for recommendation engines to recommend movies to users.
ML_100k: The movielens 100k dataset for recommendation engines to recommend movies to users.
MNIST_SAMPLE: A sample of the famous MNIST dataset consisting of handwritten digits.
MNIST_TINY: A tiny version of the famous MNIST dataset consisting of handwritten digits.
MNIST_VAR_SIZE_TINY:
PLANET_SAMPLE: A sample of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space.
PLANET_TINY: A tiny version of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space for faster experimentation and prototyping.
IMAGENETTE: A smaller version of the imagenet dataset pronounced just like ‘Imagenet’, except with a corny inauthentic French accent.
IMAGENETTE_160: The 160px version of the Imagenette dataset.
IMAGENETTE_320: The 320px version of the Imagenette dataset.
IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds.
IMAGEWOOF_160: 160px version of the ImageWoof dataset.
IMAGEWOOF_320: 320px version of the ImageWoof dataset.
IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem
IMAGEWANG_160: 160px version of Imagewang.
IMAGEWANG_320: 320px version of Imagewang.

Kaggle competition datasets:

DOGS: Image dataset consisting of dogs and cats images from Dogs vs Cats kaggle competition.

Image Classification datasets:

CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc ‘Aurelio Ranzato.
CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
FLOWERS: 17 category flower dataset by gathering images from various websites.
FOOD:
MNIST: MNIST dataset consisting of handwritten digits.
PETS: A 37 category pet dataset with roughly 200 images for each class.

NLP datasets:

AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
MT_ENG_FRA: Machine translation dataset from English to French.
SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
YAHOO_ANSWERS: YAHOO’s question answers dataset.
YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.

Image localization datasets:

BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
CAMVID: Consists of driving labelled dataset for segmentation type models.
CAMVID_TINY: A tiny camvid dataset for segmentation type models.
LSUN_BEDROOMS: Large-scale Image Dataset using Deep Learning with Humans in the Loop
PASCAL_2007: Pascal 2007 dataset to recognize objects from a number of visual object classes in realistic scenes.
PASCAL_2012: Pascal 2012 dataset to recognize objects from a number of visual object classes in realistic scenes.

Audio classification:

MACAQUES: 7285 macaque coo calls across 8 individuals from Distributed acoustic cues for caller identity in macaque vocalization.
ZEBRA_FINCH: 3405 zebra finch calls classified across 11 call types. Additional labels include name of individual making the vocalization and its age.

Medical imaging datasets:

SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.
TCGA_SMALL: A smaller version of the TCGA-OV dataset with subcutaneous and visceral fat segmentations. Citations:

Holback, C., Jarosz, R., Prior, F., Mutch, D. G., Bhosale, P., Garcia, K., … Erickson, B. J. (2016). Radiology Data from The Cancer Genome Atlas Ovarian Cancer [TCGA-OV] collection. The Cancer Imaging Archive. http://doi.org/10.7937/K9/TCIA.2016.NDO1MDFQ

Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. https://link.springer.com/article/10.1007/s10278-013-9622-7

Pretrained models:

OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
WT103_FWD: The WikiText-103 forward language model weights.
WT103_BWD: The WikiText-103 backward language model weights.

To download any of the datasets or pretrained weights, simply run untar_data by passing any dataset name mentioned above like so:

path = untar_data(URLs.PETS)
path.ls()
> > (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),...]

To download model pretrained weights:```python path = untar_data(URLs.PETS) path.ls()

(#2) [Path(‘/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl’),Path(‘/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth’)] ```

`Config`[source]

Config(cfg_name='settings.ini')

Reading and writing settings.ini

If a config file doesn’t exist already, it is always created at ~/.fastai/config.yml location by default whenever an instance of the Config class is created. Here is a quick example to explain:

config_file = Path("~/.fastai/config.yml").expanduser()
if config_file.exists(): os.remove(config_file)
assert not config_file.exists()
config = Config()
assert config_file.exists()

The config is now available as config.d:

config.d

{'archive_path': '/home/jhoward/.fastai/archive',
 'data_path': '/home/jhoward/.fastai/data',
 'model_path': '/home/jhoward/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

As can be seen, this is a basic config file that consists of data_path, model_path, storage_path and archive_path. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data_path while all pretrained model weights are download to model_path unless the default download location is updated.

Please note that it is possible to update the default path locations in the config file. Let’s first create a backup of the config file, then update the config to show the changes and re update the new config with the backup file.

if config_file.exists(): shutil.move(config_file, config_bak)
config['archive_path'] = Path(".")
config.save()

config = Config()
config.d

{'archive_path': '.',
 'data_archive_path': '/home/jhoward/.fastai/data',
 'data_path': '/home/jhoward/.fastai/data',
 'model_path': '/home/jhoward/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

The archive_path has been updated to ".". Now let’s remove any updates we made to Config file that we made for the purpose of this example.

if config_bak.exists(): shutil.move(config_bak, config_file)
config = Config()
config.d

{'archive_path': '/home/jhoward/.fastai/archive',
 'data_archive_path': '/home/jhoward/.fastai/data',
 'data_path': '/home/jhoward/.fastai/data',
 'model_path': '/home/jhoward/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

`class` `URLs`[source]

URLs()

Global constants for dataset and model URLs.

The default local path is at ~/.fastai/archive/ but this can be updated by passing a different c_key. Note: c_key should be one of 'archive_path', 'data_archive_path', 'data_path', 'model_path', 'storage_path'.

url = URLs.PETS
local_path = URLs.path(url)
test_eq(local_path.parent, Config()['archive']); 
local_path

Path('/home/jhoward/.fastai/archive/oxford-iiit-pet.tgz')

local_path = URLs.path(url, c_key='model')
test_eq(local_path.parent, Config()['model'])
local_path

Path('/home/jhoward/.fastai/models/oxford-iiit-pet.tgz')

Downloading

`download_url`[source]

download_url(url, dest, overwrite=False, pbar=None, show_progress=True, chunk_size=1048576, timeout=4, retries=5)

Download url to dest unless it exists and not overwrite

The download_url is a very handy function inside fastai! This function can be used to download any file from the internet to a location passed by dest argument of the function. It should not be confused, that this function can only be used to download fastai-files. That couldn’t be further away from the truth. As an example, let’s download the pets dataset from the actual source file:

fname = Path("./dog.jpg")
if fname.exists(): os.remove(fname)
url = "https://i.insider.com/569fdd9ac08a80bd448b7138?width=1100&format=jpeg&auto=webp"
download_url(url, fname)
assert fname.exists()

Let’s confirm that the file was indeed downloaded correctly.

from PIL import Image

im = Image.open(fname)
plt.imshow(im);

As can be seen, the file has been downloaded to the local path provided in dest argument. Calling the function again doesn’t trigger a download since the file is already there. This can be confirmed by checking that the last modified time of the file that is downloaded doesn’t get updated.

if fname.exists(): last_modified_time = os.path.getmtime(fname)
download_url(url, fname)
test_eq(os.path.getmtime(fname), last_modified_time)
if fname.exists(): os.remove(fname)

We can also use the download_url function to download the pet’s dataset straight from the source by simply passing https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz in url.

`download_data`[source]

download_data(url, fname=None, c_key='archive', force_download=False, timeout=4)

Download url to fname.

The download_data is a convenience function and a wrapper outside download_url to download fastai files to the appropriate local path based on the c_key.

If fname is None, it will default to the archive folder you have in your config file (or data, model if you specify a different c_key) followed by the last part of the url: for instance URLs.MNIST_SAMPLE is http://files.fast.ai/data/examples/mnist_sample.tgz and the default value for fname will be ~/.fastai/archive/mnist_sample.tgz.

If force_download=True, the file is alwayd downloaded. Otherwise, it’s only when the file doesn’t exists that the download is triggered.

Extract

`file_extract`[source]

file_extract(fname, dest=None)

Extract fname to dest using tarfile or zipfile.

file_extract is used by default in untar_data to decompress the downloaded file.

`newest_folder`[source]

newest_folder(path)

Return newest folder on path

`rename_extracted`[source]

rename_extracted(dest)

Rename file if different from dest

let’s rename the untar/unzip data if dest name is different from fname

`untar_data`[source]

untar_data(url, fname=None, dest=None, c_key='data', force_download=False, extract_func=file_extract, timeout=4)

Download url to fname if dest doesn’t exist, and un-tgz or unzip to folder dest.

untar_data is a very powerful convenience function to download files from url to dest. The url can be a default url from the URLs class or a custom url. If dest is not passed, files are downloaded at the default_dest which defaults to ~/.fastai/data/.

This convenience function extracts the downloaded files to dest by default. In order, to simply download the files without extracting, pass the noop function as extract_func.

Note, it is also possible to pass a custom extract_func to untar_data if the filetype doesn’t end with .tgz or .zip. The gzip and zip files are supported by default and there is no need to pass custom extract_func for these type of files.

Internally, if files are not available at fname location already which defaults to ~/.fastai/archive/, the files get downloaded at ~/.fastai/archive and are then extracted at dest location. If no dest is passed the default_dest to download the files is ~/.fastai/data. If files are already available at the fname location but not available then a symbolic link is created for each file from fname location to dest.

Also, if force_download is set to True, files are re downloaded even if they exist.

from tempfile import TemporaryDirectory

test_eq(untar_data(URLs.MNIST_SAMPLE), config.data/'mnist_sample')
with TemporaryDirectory() as d:
    d = Path(d)
    dest = untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
    assert Path('mnist_tiny.tgz').exists()
    assert (d/'mnist_tiny').exists()
    os.unlink('mnist_tiny.tgz')
#Test c_key
tst_model = config.model/'mnist_sample'
test_eq(untar_data(URLs.MNIST_SAMPLE, c_key='model'), tst_model)
assert not tst_model.with_suffix('.tgz').exists() #Archive wasn't downloaded in the models path
assert (config.archive/'mnist_sample.tgz').exists() #Archive was downloaded there
shutil.rmtree(tst_model)

Sometimes the extracted folder does not have the same name as the downloaded file.

with TemporaryDirectory() as d:
    d = Path(d)
    untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
    Path('mnist_tiny.tgz').rename('nims_tini.tgz')
    p = Path('nims_tini.tgz')
    dest = Path('nims_tini')
    assert p.exists()
    file_extract(p, dest.parent)
    rename_extracted(dest)
    p.unlink()
    shutil.rmtree(dest)