widgets.image_cleaner
Image Cleaner Widget
fastai offers several widgets to support the workflow of a deep learning practitioner. The purpose of the widgets are to help you organize, clean, and prepare your data for your model. Widgets are separated by data type.
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
learn = cnn_learner(data, models.resnet18, metrics=error_rate)
learn.fit_one_cycle(2)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.233059 | 0.115309 | 0.033857 | 00:10 |
1 | 0.110543 | 0.085249 | 0.028459 | 00:10 |
learn.save('stage-1')
We create a databunch with all the data in the training set and no validation set (DatasetFormatter uses only the training set)
db = (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.databunch())
learn = cnn_learner(db, models.resnet18, metrics=[accuracy])
learn.load('stage-1');
class
DatasetFormatter
[source][test]
DatasetFormatter
() Tests found forDatasetFormatter
:
Some other tests where DatasetFormatter
is used:
pytest -sv tests/test_widgets_image_cleaner.py::test_image_cleaner_with_data_from_csv
[source]
To run tests please refer to this guide.
Returns a dataset with the appropriate format and file indices to be displayed.
The DatasetFormatter
class prepares your image dataset for widgets by returning a formatted DatasetTfm
based on the DatasetType
specified. Use from_toplosses
to grab the most problematic images directly from your learner. Optionally, you can restrict the formatted dataset returned to n_imgs
.
from_similars
[source][test]
from_similars
(learn
,layer_ls
:list
=[0, 7, 2]
, **kwargs
) Tests found forfrom_similars
:
Some other tests where from_similars
is used:
pytest -sv tests/test_widgets_image_cleaner.py::test_image_cleaner_with_data_from_csv
[source]
To run tests please refer to this guide.
Gets the indices for the most similar images.
from_toplosses
[source][test]
from_toplosses
(learn
,n_imgs
=None
, **kwargs
) No tests found forfrom_toplosses
. To contribute a test please refer to this guide and this discussion.
Gets indices with top losses.
from_most_unsure
[source][test]
from_most_unsure
(learn
:Learner
,num
=50
) →Tuple
[DataLoader
,List
[int
],Sequence
[str
],List
[str
]] No tests found forfrom_most_unsure
. To contribute a test please refer to this guide and this discussion.
Gets num
items from the test set, for which the difference in probabilities between the most probable and second most probable classes is minimal.
class
ImageCleaner
[source][test]
ImageCleaner
(dataset
:LabelLists
,fns_idxs
:Collection
[int
],path
:PathOrStr
,batch_size
=5
,duplicates
=False
) ::BasicImageWidget
Tests found forImageCleaner
:
pytest -sv tests/test_widgets_image_cleaner.py::test_image_cleaner_index_length_mismatch
[source]pytest -sv tests/test_widgets_image_cleaner.py::test_image_cleaner_length_correct
[source]pytest -sv tests/test_widgets_image_cleaner.py::test_image_cleaner_with_data_from_csv
[source]pytest -sv tests/test_widgets_image_cleaner.py::test_image_cleaner_wrong_input_type
[source]
To run tests please refer to this guide.
Displays images for relabeling or deletion and saves changes in path
as ‘cleaned.csv’.
ImageCleaner
is for cleaning up images that don’t belong in your dataset. It renders images in a row and gives you the opportunity to delete the file from your file system. To use ImageCleaner
we must first use DatasetFormatter().from_toplosses
to get the suggested indices for misclassified images.
ds, idxs = DatasetFormatter().from_toplosses(learn)
ImageCleaner(ds, idxs, path)
<fastai.widgets.image_cleaner.ImageCleaner at 0x7f45bd324240>
ImageCleaner
does not change anything on disk (neither labels or existence of images). Instead, it creates a ‘cleaned.csv’ file in your data path from which you need to load your new databunch for the files to changes to be applied.
df = pd.read_csv(path/'cleaned.csv', header='infer')
# We create a databunch from our csv. We include the data in the training set and we don't use a validation set (DatasetFormatter uses only the training set)
np.random.seed(42)
db = (ImageList.from_df(df, path)
.split_none()
.label_from_df()
.databunch(bs=64))
learn = cnn_learner(db, models.resnet18, metrics=error_rate)
learn = learn.load('stage-1')
You can then use ImageCleaner
again to find duplicates in the dataset. To do this, you can specify duplicates=True
while calling ImageCleaner after getting the indices and dataset from .from_similars
. Note that if you are using a layer’s output which has dimensions (n_batches, n_features, 1, 1)
then you don’t need any pooling (this is the case with the last layer). The suggested use of .from_similars()
with resnets is using the last layer and no pooling, like in the following cell.
ds, idxs = DatasetFormatter().from_similars(learn, layer_ls=[0,7,1], pool=None)
Getting activations...
100.00% [226/226 00:02<00:00]
Computing similarities...
ImageCleaner(ds, idxs, path, duplicates=True)
<fastai.widgets.image_cleaner.ImageCleaner at 0x7f236f7214a8>
class
PredictionsCorrector
[source][test]
PredictionsCorrector
(dataset
:LabelLists
,fns_idxs
:Collection
[int
],classes
:Sequence
[str
],labels
:Sequence
[str
],batch_size
:int
=5
) ::BasicImageWidget
No tests found forPredictionsCorrector
. To contribute a test please refer to this guide and this discussion.
Displays images for manual inspection and relabelling.
In competitions, you need to provide predictions for the test set, for which true labels are unknown. You can slightly improve your score by manually correcting the most egregious misclassifications. Test set in competitions is usually large and cannot be inspected manually in its entirety. A good subset to look at is the images the model is the least sure about, i.e. the difference in probabilities between the most probable and the second most probable classes is minimal.
Let’s start by creating an artificial test set and training a model:
db_test = ImageDataBunch.from_folder(path/'train', valid_pct = 0.2, test = '../valid')
learn = cnn_learner(db_test, models.resnet18, metrics=[accuracy])
learn.fit_one_cycle(2)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.245369 | 0.138495 | 0.953610 | 02:14 |
1 | 0.133183 | 0.103801 | 0.963695 | 02:16 |
Now we can take a look at the most unsure images:
most_unsure = DatasetFormatter.from_most_unsure(learn)
wgt = PredictionsCorrector(*most_unsure)
'No images to show :)'
show_corrections
[source][test]
show_corrections
(ncols
:int
, **fig_kw
) No tests found forshow_corrections
. To contribute a test please refer to this guide and this discussion.
Shows a grid of images whose predictions have been corrected.
When you finished interacting with a widget, you can take a look at the corrections that were made. show_corrections
displays image’s index in the test set, previous label and the new label.
wgt.show_corrections(ncols=6, figsize=(9, 7))
corrected_labels
[source][test]
corrected_labels
() →List
[str
] No tests found forcorrected_labels
. To contribute a test please refer to this guide and this discussion.
Returns labels for the entire test set with corrections applied.
show_doc(ImageDownloader)
class
ImageDownloader
[source][test]
ImageDownloader
(path
:PathOrStr
='data'
) Tests found forImageDownloader
:
pytest -sv tests/test_widgets_image_cleaner.py::test_image_downloader_with_path
[source]
To run tests please refer to this guide.
Displays a widget that allows searching and downloading images from google images search in a Jupyter Notebook or Lab.
ImageDownloader
widget gives you a way to quickly bootstrap your image dataset without leaving the notebook. It searches and downloads images that match the search criteria and resolution / quality requirements and stores them on your filesystem within the provided path
.
Images for each search query (or label) are stored in a separate folder within path
. For example, if you pupulate tiger
with a path
setup to ./data
, you’ll get a folder ./data/tiger/
with the tiger images in it.
ImageDownloader
will automatically clean up and verify the downloaded images with verify_images()
after downloading them.
path = Config.data_path()/'image_downloader'
os.makedirs(path, exist_ok=True)
ImageDownloader(path)
<fastai.widgets.image_downloader.ImageDownloader at 0x7f236c056358>
Downloading images in python scripts outside Jupyter notebooks
path = Config.data_path()/'image_downloader'
files = download_google_images(path, 'aussie shepherd', size='>1024*768', n_images=30)
len(files)
100.00% [30/30 00:00<00:00]
100.00% [30/30 00:00<00:00]
30
download_google_images
[source][test]
download_google_images
(path
:PathOrStr
,search_term
:str
,size
:str
=`’>400300’`*,n_images
:int
=10
,format
:str
='jpg'
,max_workers
:int
=8
,timeout
:int
=4
) →FilePathList
No tests found fordownload_google_images
. To contribute a test please refer to this guide and this discussion.
Search for n_images
images on Google, matching search_term
and size
requirements, download them into path
/search_term
and verify them, using max_workers
threads.
After populating images with ImageDownloader
, you can get a an ImageDataBunch
by calling ImageDataBunch.from_folder(path, size=size)
, or using the data block API.
# Setup path and labels to search for
path = Config.data_path()/'image_downloader'
labels = ['boston terrier', 'french bulldog']
# Download images
for label in labels:
download_google_images(path, label, size='>400*300', n_images=50)
# Build a databunch and train!
src = (ImageList.from_folder(path)
.split_by_rand_pct()
.label_from_folder()
.transform(get_transforms(), size=224))
db = src.databunch(bs=16, num_workers=0)
learn = cnn_learner(db, models.resnet34, metrics=[accuracy])
learn.fit_one_cycle(3)
100.00% [50/50 00:00<00:00]
100.00% [50/50 00:00<00:00]
cannot identify image file <_io.BufferedReader name='/home/ubuntu/.fastai/data/image_downloader/boston terrier/00000044.jpg'>
cannot identify image file <_io.BufferedReader name='/home/ubuntu/.fastai/data/image_downloader/boston terrier/00000014.jpg'>
Interrupted
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/home/ubuntu/anaconda3/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
conn.connect()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 301, in connect
conn = self._new_conn()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f236da84358>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='frenchbulldogpuppiesflorida.com', port=443): Max retries exceeded with url: /wp-content/uploads/2018/04/cropped-frenchbulldog2-french840-2.jpg (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f236da84358>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/ubuntu/fastai/fastai/widgets/image_downloader.py", line 179, in _download_single_image
download_url(img_tuple[1], label_path/fname, timeout=timeout)
File "/home/ubuntu/fastai/fastai/core.py", line 177, in download_url
u = s.get(url, stream=True, timeout=timeout)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='frenchbulldogpuppiesflorida.com', port=443): Max retries exceeded with url: /wp-content/uploads/2018/04/cropped-frenchbulldog2-french840-2.jpg (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f236da84358>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
"""
The above exception was the direct cause of the following exception:
ConnectionError Traceback (most recent call last)
<ipython-input-25-460e5f861575> in <module>
5 # Download images
6 for label in labels:
----> 7 download_google_images(path, label, size='>400*300', n_images=50)
8
9 # Build a databunch and train!
~/fastai/fastai/widgets/image_downloader.py in download_google_images(path, search_term, size, n_images, format, max_workers, timeout)
86 if n_images <= 100: img_tuples = _fetch_img_tuples(search_url, format=format, n_images=n_images)
87 else: img_tuples = _fetch_img_tuples_webdriver(search_url, format=format, n_images=n_images)
---> 88 downloaded_images = _download_images(label_path, img_tuples, max_workers=max_workers, timeout=timeout)
89 if len(downloaded_images) == 0: raise RuntimeError(f"Couldn't download any images.")
90 verify_images(label_path, max_workers=max_workers)
~/fastai/fastai/widgets/image_downloader.py in _download_images(label_path, img_tuples, max_workers, timeout)
165 """
166 os.makedirs(Path(label_path), exist_ok=True)
--> 167 parallel( partial(_download_single_image, label_path, timeout=timeout), img_tuples, max_workers=max_workers)
168 return get_image_files(label_path)
169
~/fastai/fastai/core.py in parallel(func, arr, max_workers)
326 futures = [ex.submit(func,o,i) for i,o in enumerate(arr)]
327 results = []
--> 328 for f in progress_bar(concurrent.futures.as_completed(futures), total=len(arr)): results.append(f.result())
329 if any([o is not None for o in results]): return results
330
~/anaconda3/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~/anaconda3/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
ConnectionError: None: Max retries exceeded with url: /wp-content/uploads/2018/04/cropped-frenchbulldog2-french840-2.jpg (Caused by None)
Downloading more than a hundred images
To fetch more than a hundred images, ImageDownloader
uses selenium
and chromedriver
to scroll through the Google Images search results page and scrape image URLs. They’re not required as dependencies by default. If you don’t have them installed on your system, the widget will show you an error message.
To install selenium
, just pip install selenium
in your fastai environment.
On a mac, you can install chromedriver
with brew cask install chromedriver
.
On Ubuntu Take a look at the latest Chromedriver version available, then something like:
wget https://chromedriver.storage.googleapis.com/2.45/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
Note that downloading under 100 images doesn’t require any dependencies other than fastai itself, however downloading more than a hundred images uses selenium
and chromedriver
.
size
can be one of:
'>400*300'
'>640*480'
'>800*600'
'>1024*768'
'>2MP'
'>4MP'
'>6MP'
'>8MP'
'>10MP'
'>12MP'
'>15MP'
'>20MP'
'>40MP'
'>70MP'
Methods
©2021 fast.ai. All rights reserved.
Site last generated: Jan 5, 2021