Limitations
Dask.distributed has limitations. Understanding these can help you to reliablycreate efficient distributed computations.
Performance
- The central scheduler spends a few hundred microseconds on every task. Foroptimal performance, task durations should be greater than 10-100ms.
- Dask can not parallelize within individual tasks. Individual tasks shouldbe a comfortable size so as not to overwhelm any particular worker.
- Dask assigns tasks to workers heuristically. It usually makes the rightdecision, but non-optimal situations do occur.
- The workers are just Python processes, and inherit all capabilities andlimitations of Python. They do not bound or limit themselves in any way.In production you may wish to run dask-workers within containers.
Assumptions on Functions and Data
Dask assumes the following about your functions and your data:
- All functions must be serializable either with pickle orcloudpickle. This isusually the case except in fairly exotic situations. Thefollowing should work:
- from cloudpickle import dumps, loads
- loads(dumps(my_object))
All data must be serializable either with pickle, cloudpickle, or usingDask’s custom serialization system.
Dask may run your functions multiple times,such as if a worker holding an intermediate result dies. Any side effectsshould be idempotent.
Security
As a distributed computing framework, Dask enables the remote execution ofarbitrary code. You should only host dask-workers within networks that youtrust. This is standard among distributed computing frameworks, but is worthrepeating.