Bag
Dask Bag implements operations like map
, filter
, fold
, andgroupby
on collections of generic Python objects. It does this in parallel with asmall memory footprint using Python iterators. It is similar to a parallelversion of PyToolz or a Pythonic version of the PySpark RDD.
Examples
Visit https://examples.dask.org/bag.html to see and run examples using Dask Bag.
Design
Dask bags coordinate many Python lists or Iterators, each of which forms apartition of a larger collection.
Common Uses
Dask bags are often used to parallelize simple computations on unstructured orsemi-structured data like text data, log files, JSON records, or user definedPython objects.
Execution
Execution on bags provide two benefits:
- Parallel: data is split up, allowing multiple cores or machines to executein parallel
- Iterating: data processes lazily, allowing smooth execution oflarger-than-memory data, even on a single machine within a single partition
Default scheduler
By default, dask.bag
uses dask.multiprocessing
for computation. As abenefit, Dask bypasses the GIL and uses multiple cores on pure Python objects.As a drawback, Dask Bag doesn’t perform well on computations that include agreat deal of inter-worker communication. For common operations this is rarelyan issue as most Dask Bag workflows are embarrassingly parallel or result inreductions with little data moving between workers.
Because the multiprocessing scheduler requires moving functions between multipleprocesses, we encourage that Dask Bag users also install the cloudpickle library toenable the transfer of more complex functions.
Shuffle
Some operations, like groupby
, require substantial inter-workercommunication. On a single machine, Dask uses partd to perform efficient,parallel, spill-to-disk shuffles. When working in a cluster, Dask uses a taskbased shuffle.
These shuffle operations are expensive and better handled by projects likedask.dataframe
. It is best to use dask.bag
to clean and process data,then transform it into an array or DataFrame before embarking on the morecomplex operations that require shuffle steps.
Known Limitations
Bags provide very general computation (any Python function). This generalitycomes at cost. Bags have the following known limitations:
- By default, they rely on the multiprocessing scheduler, which has its ownset of known limitations (see Shared Memory)
- Bags are immutable and so you can not change individual elements
- Bag operations tend to be slower than array/DataFrame computations in thesame way that standard Python containers tend to be slower than NumPyarrays and Pandas DataFrames
- Bag’s
groupby
is slow. You should try to use Bag’sfoldby
if possible.Usingfoldby
requires more thought tough
Name
Bag is the mathematical name for an unordered collection allowing repeats. Itis a friendly synonym to multiset). A bag, or a multiset, is a generalization ofthe concept of a set that, unlike a set, allows multiple instances of themultiset’s elements:
list
: ordered collection with repeats,[1, 2, 3, 2]
set
: unordered collection without repeats,{1, 2, 3}
bag
: unordered collection with repeats,{1, 2, 2, 3}
So, a bag is like a list, but it doesn’t guarantee an ordering among elements.There can be repeated elements but you can’t ask for the ith element.