Setup
This page describes various ways to set up Dask on different hardware, eitherlocally on your own machine or on a distributed cluster. If you are justgetting started, then this page is unnecessary. Dask does not require any setupif you only want to use it on a single computer.
Dask has two families of task schedulers:
- Single machine scheduler: This scheduler provides basic features on alocal process or thread pool. This scheduler was made first and is thedefault. It is simple and cheap to use. It can only be used on a singlemachine and does not scale.
- Distributed scheduler: This scheduler is more sophisticated. It offersmore features, but also requires a bit more effort to set up. It canrun locally or distributed across a cluster.If you import Dask, set up a computation, and then call
compute
, then youwill use the single-machine scheduler by default. To use thedask.distributed
scheduler you must set up aClient
- import dask.dataframe as dd
- df = dd.read_csv(...)
- df.x.sum().compute() # This uses the single-machine scheduler by default
- from dask.distributed import Client
- client = Client(...) # Connect to distributed cluster and override default
- df.x.sum().compute() # This now runs on the distributed system
Note that the newer dask.distributed
scheduler is often preferable, even onsingle workstations. It contains many diagnostics and features not found inthe older single-machine scheduler. The following pages explain in more detailhow to set up Dask on a variety of local and distributed hardware.
- Single Machine:
- Default Scheduler: The no-setup default.Uses local threads or processes for larger-than-memory processing
- dask.distributed: The sophistication ofthe newer system on a single machine. This provides more advancedfeatures while still requiring almost no setup.
- Distributed computing:
- Manual Setup: The command line interface to set up
dask-scheduler
anddask-worker
processes. Useful for IT oranyone building a deployment solution. - SSH: Use SSH to set up Dask across an un-managedcluster.
- High Performance Computers: How to run Dask ontraditional HPC environments using tools like MPI, or job schedulers likeSLURM, SGE, TORQUE, LSF, and so on.
- Kubernetes: Deploy Dask with thepopular Kubernetes resource manager using either Helm or a native deployment.
- YARN / Hadoop: DeployDask on YARN clusters, such as are found in traditional Hadoopinstallations.
- Python API (advanced): Create
Scheduler
andWorker
objects from Python as part of a distributedTornado TCP application. This page is useful for those building customframeworks. - Docker containers are available and may be usefulin some of the solutions above.
- Cloud for current recommendations on how todeploy Dask and Jupyter on common cloud providers like Amazon, Google, orMicrosoft Azure.
- Manual Setup: The command line interface to set up