Joblib
Many Scikit-Learn algorithms are written for parallel execution usingJoblib, which natively providesthread-based and process-based parallelism. Joblib is what backs then_jobs=
parameter in normal use of Scikit-Learn.
Dask can scale these Joblib-backed algorithms out to a cluster of machines byproviding an alternative Joblib backend. The following video demonstrates howto use Dask to parallelize a grid search across a cluster.
To use the Dask backend to Joblib you have to create a Client, and wrap yourcode with joblib.parallel_backend('dask')
.
- from dask.distributed import Client
- import joblib
- client = Client(processes=False) # create local cluster
- # client = Client("scheduler-address:8786") # or connect to remote cluster
- with joblib.parallel_backend('dask'):
- # Your scikit-learn code
As an example you might distribute a randomized cross validated parametersearch as follows:
- import numpy as np
- from dask.distributed import Client
- import joblib
- from sklearn.datasets import load_digits
- from sklearn.model_selection import RandomizedSearchCV
- from sklearn.svm import SVC
- client = Client(processes=False) # create local cluster
- digits = load_digits()
- param_space = {
- 'C': np.logspace(-6, 6, 13),
- 'gamma': np.logspace(-8, 8, 17),
- 'tol': np.logspace(-4, -1, 4),
- 'class_weight': [None, 'balanced'],
- }
- model = SVC(kernel='rbf')
- search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
- with joblib.parallel_backend('dask'):
- search.fit(digits.data, digits.target)
Note that the Dask joblib backend is useful for scaling out CPU-bound workloads;workloads with datasets that fit in RAM, but have many individual operationsthat can be done in parallel. To scale out to RAM-bound workloads(larger-than-memory datasets) use one of the following alternatives:
- Parallel Meta-estimators
- Incremental Hyperparameter Optimization
- or one of the estimators from the API Reference