Why Dask?
This document gives high-level motivation on why people choose to adopt Dask.
- Python’s role in Data Science
- Dask has a Familiar API
- Dask Scales out to Clusters
- Dask Scales Down to Single Computers
- Dask Integrates Natively with Python Code
- Dask Supports Complex Applications
- Dask Delivers Responsive Feedback
- Links and More Information
Python’s role in Data Science
Python has grown to become the dominant language both in data analytics andgeneral programming:This is fueled both by computational libraries like Numpy, Pandas, andScikit-Learn and by a wealth of libraries for visualization, interactivenotebooks, collaboration, and so forth.However, these packages were not designed to scale beyond a single machine.Dask was developed to scale these packages and the surrounding ecosystem.It works with the existing Python ecosystem to scale it to multi-coremachines and distributed clusters.
Image credit to Stack Overflow blogposts#1and#2.
Dask has a Familiar API
Analysts often use tools like Pandas, Scikit-Learn, Numpy, and the rest of thePython ecosystem to analyze data on their personal computer. They like thesetools because they are efficient, intuitive, and widely trusted. However, whenthey choose to apply their analyses to larger datasets, they find that thesetools were not designed to scale beyond a single machine. And so, the analystrewrites their computation using a more scalable tool, often in anotherlanguage altogether. This rewrite process slows down discovery and causesfrustration.
Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows morenatively, with minimal rewriting. It integrates well with these tools so that it copiesmost of their API and uses their data structures internally. Moreover, Dask isco-developed with these libraries to ensure that they evolve consistently,minimizing friction when transitioning from a local laptop,to a multi-core workstation, and then to a distributed cluster. Analystsfamiliar with Pandas/Scikit-Learn/Numpy will be immediately familiar with theirDask equivalents, and have much of their intuition carry over to a scalablecontext.
Dask Scales out to Clusters
As datasets and computations scale faster than CPUs and RAM, we need to findways to scale our computations across multiple machines. This introduces manynew concerns:
- How to have computers talk to each other over the network?
- How and when to move data between machines?
- How to recover from machine failures?
- How to deploy on an in-house cluster?
- How to deploy on the cloud?
- How to deploy on an HPC super-computer?
- How to provide an API to this system that users find intuitive?
- …
While it is possible to build these systems in-house (and indeed, many exist),many organizations increasingly depend on solutions developed within theopen source community. These tend to be more robust, secure, and fullyfeatured without being tended by in-house staff.
Dask solves the problems above. It figures out how to break up largecomputations and route parts of them efficiently onto distributed hardware.Dask is routinely run on thousand-machine clusters to process hundreds ofterabytes of data efficiently within secure environments.
Dask has utilities and documentation on how to deploy in-house, onthe cloud, or on HPC super-computers. It supports encryption andauthentication using TLS/SSL certificates. It is resilient and can handle thefailure of worker nodes gracefully and is elastic, and so can take advantage ofnew nodes added on-the-fly. Dask includes several user APIs that are used andsmoothed over by thousands of researchers across the globe working in differentdomains.
Dask Scales Down to Single Computers
But a massive cluster is not always the right choice
Today’s laptops and workstations are surprisingly powerful and, if usedcorrectly, can handle datasets and computations for which we previouslydepended on clusters. A modern laptop has a multi-core CPU, 32GB of RAM, andflash-based hard drives that can stream through data several times faster thanHDDs or SSDs of even a year or two ago.
As a result, Dask can empower analysts to manipulate 100GB+ datasets on theirlaptop or 1TB+ datasets on a workstation without bothering with the cluster atall. This can be preferable for the following reasons:
- They can use their local software environment, rather than beingconstrained by what is available on the cluster or having to manageDocker images.
- They can more easily work while in transit, at a coffee shop, or at homeaway from the corporate network
- Debugging errors and analyzing performance is simpler and more pleasant ona single machine
- Their iteration cycles can be faster
- Their computations may be more efficient because all of the data is localand doesn’t need to flow through the network or between separate processesDask can enable efficient parallel computations on single machines byleveraging their multi-core CPUs and streaming data efficiently from disk.It can run on a distributed cluster, but it doesn’t have to. Dask allowsyou to swap out the cluster for single-machine schedulers which are surprisinglylightweight, require no setup, and can run entirely within the same process asthe user’s session.
To avoid excess memory use, Dask is good at finding ways to evaluatecomputations in a low-memory footprint when possible by pulling in chunks ofdata from disk, doing the necessary processing, and throwing away intermediatevalues as quickly as possible. This lets analysts perform computations onmoderately large datasets (100GB+) even on relatively low-power laptops.This requires no configuration and no setup, meaning that adding Dask to asingle-machine computation adds very little cognitive overhead.
Dask is installed by default with Anacondaand so is already deployed on most data science machines.
Dask Integrates Natively with Python Code
Python includes computational libraries like Numpy, Pandas, and Scikit-Learn,and many others for data access, plotting, statistics, image andsignal processing, and more. These libraries work together seamlessly toproduce a cohesive ecosystem of packages that co-evolve to meet the needs ofanalysts in most domains today.
This ecosystem is tied together by common standards and protocols to whicheveryone adheres, which allows these packages to benefit each other insurprising and delightful ways.
Dask evolved from within this ecosystem. It abides by these standards andprotocols and actively engages in community efforts to push forward new ones.This enables the rest of the ecosystem to benefit from parallel and distributedcomputing with minimal coordination. Dask does not seek to disrupt or displacethe existing ecosystem, but rather to complement and benefit it from within.
As a result, Dask development is pushed forward by developer communitiesfrom Pandas, Numpy, Scikit-Learn, Scikit-Image, Jupyter, and others. Thisengagement from the broader community growth helps users to trust the projectand helps to ensure that the Python ecosystem will continue to evolve in asmooth and sustainable manner.
Dask Supports Complex Applications
Some parallel computations are simple and just apply the same routine onto manyinputs without any kind of coordination. These are simple to parallelize withany system.
Somewhat more complex computations can be expressed with themap-shuffle-reduce pattern popularized by Hadoop and Spark.This is often sufficient to do most data cleaning tasks,database-style queries, and some lightweight machine learning algorithms.
However, more complex parallel computations exist which do not fit into theseparadigms, and so are difficult to perform with traditional big-datatechnologies. These include more advanced algorithms for statistics or machinelearning, time series or local operations, or bespoke parallelism often foundwithin the systems of large enterprises.
Many companies and institutions today have problems which areclearly parallelizable, but not clearly transformable into a big DataFramecomputation. Today these companies tend to solve their problems either bywriting custom code with low-level systems like MPI, ZeroMQ, or sockets andcomplex queuing systems, or by shoving their problem into a standard big-datatechnology like MapReduce or Spark, and hoping for the best.
Dask helps to resolve these situations by exposing low-level APIs to itsinternal task scheduler which is capable of executing very advancedcomputations. This gives engineers within the institution the ability to buildtheir own parallel computing system using the same engine that powers Dask’sarrays, DataFrames, and machine learning algorithms, but now with theinstitution’s own custom logic. This allows engineers to keep complexbusiness logic in-house while still relying on Dask to handle networkcommunication, load balancing, resilience, diagnostics, etc..
Dask Delivers Responsive Feedback
Because everything happens remotely, interactive parallel computing can befrustrating for users. They don’t have a good sense of how computations areprogressing, what might be going wrong, or what parts of their code should theyfocus on for performance. The added distance between a user and theircomputation can drastically affect how quickly they are able to identify andresolve bugs and performance problems, which can drastically increase theirtime to solution.
Dask keeps users informed and content with a suite of helpful diagnostic andinvestigative tools including the following:
- A real-time and responsive dashboardthat shows current progress, communication costs, memory use, and more,updated every 100ms
- A statistical profiler installed on every worker that polls each threadevery 10ms to determine which lines in your code are taking up the mosttime across your entire computation
- An embedded IPython kernel in every worker and the scheduler, allowingusers to directly investigate the state of their computation with a pop-upterminal
- The ability to reraise errors locally, so that they can use the traditionaldebugging tools to which they are accustomed, even when the error happensremotely
Links and More Information
From here you may want to read about some of our more common introductorycontent: