Related Work
Writing the “related work” for a project called “distributed”, is a Sisypheantask. We’ll list a few notable projects that you’ve probably already heard ofdown below.
You may also find the dask comparison with spark of interest.
Big Data World
- The venerable Hadoop provides batch processing with the MapReduceprogramming paradigm. Python users typically use Hadoop Streaming orMRJob.
- Spark builds on top of HDFS systems with a nicer API and in-memoryprocessing. Python users typically use PySpark.
- Storm provides streaming computation. Python users typically usestreamparse.
This is a woefully inadequate representation of the excellent work blossomingin this space. A variety of projects have come into this space and rival orcomplement the projects above. Still, most “Big Data” processing hype probablycenters around the three projects above, or their derivatives.
Python Projects
There are dozens of Python projects for distributed computing. Here we list afew of the more prominent projects that we see in active use today.
Task scheduling
- Celery: An asynchronous task scheduler, focusing on real-time processing.
- Luigi: A bulk big-data/batch task scheduler, with hooks to a variety ofinteresting data sources.
Ad hoc computation
- IPython Parallel: Allows for stateful remote control of several runningipython sessions.
- Scoop: Implements the concurrent.futures API on distributed workers.Notably allows tasks to spawn more tasks.
Direct Communication
- MPI4Py: Wraps the Message Passing Interface popular in high performancecomputing.
- PyZMQ: Wraps ZeroMQ, the high-performance asynchronous messaging library.
Venerable
There are a couple of older projects that often get mentioned
Relationship
In relation to these projects distributed
…
- Supports data-local computation like Hadoop and Spark
- Uses a task graph with data dependencies abstraction like Luigi
- In support of ad-hoc applications, like IPython Parallel and Scoop
In depth comparison to particular projects
IPython Parallel
Short Description
IPython Parallel is a distributed computing framework from the IPythonproject. It uses a centralized hub to farm out jobs to several ipengine
processes running on remote workers. It communicates over ZeroMQ sockets andcentralizes communication through the central hub.
IPython parallel has been around for a while and, while not particularly fancy,is quite stable and robust.
IPython Parallel offers parallel map
and remote apply
functions thatroute computations to remote workers
- >>> view = Client(...)[:]
- >>> results = view.map(func, sequence)
- >>> result = view.apply(func, *args, **kwargs)
- >>> future = view.apply_async(func, *args, **kwargs)
It also provides direct execution of code in the remote process and collectionof data from the remote namespace.
- >>> view.execute('x = 1 + 2')
- >>> view['x']
- [3, 3, 3, 3, 3, 3]
Brief Comparison
Distributed and IPython Parallel are similar in that they provide map
andapply/submit
abstractions over distributed worker processes running Python.Both manage the remote namespaces of those worker processes.
They are dissimilar in terms of their maturity, how worker nodes communicate toeach other, and in the complexity of algorithms that they enable.
Distributed Advantages
The primary advantages of distributed
over IPython Parallel include
- Peer-to-peer communication between workers
- Dynamic task scheduling
Distributed
workers share data in a peer-to-peer fashion, without having tosend intermediate results through a central bottleneck. This allowsdistributed
to be more effective for more complex algorithms and to managelarger datasets in a more natural manner. IPython parallel does not provide amechanism for workers to communicate with each other, except by using thecentral node as an intermediary for data transfer or by relying on some othermedium, like a shared file system. Data transfer through the central node caneasily become a bottleneck and so IPython parallel has been mostly helpful inembarrassingly parallel work (the bulk of applications) but has not been usedextensively for more sophisticated algorithms that require non-trivialcommunication patterns.
The distributed client includes a dynamic task scheduler capable of managingdeep data dependencies between tasks. The IPython parallel docs include arecipe for executing task graphs with data dependencies. This same idea iscore to all of distributed
, which uses a dynamic task scheduler for alloperations. Notably, distributed.Future
objects can be used withinsubmit/map/get
calls before they have completed.
- >>> x = client.submit(f, 1) # returns a future
- >>> y = client.submit(f, 2) # returns a future
- >>> z = client.submit(add, x, y) # consumes futures
The ability to use futures cheaply within submit
and map
methodsenables the construction of very sophisticated data pipelines with simple code.Additionally, distributed can serve as a full dask task scheduler, enablingsupport for distributed arrays, dataframes, machine learning pipelines, and anyother application build on dask graphs. The dynamic task schedulers withindistributed
are adapted from the dask task schedulers and so are fairlysophisticated/efficient.
IPython Parallel Advantages
IPython Parallel has the following advantages over distributed
- Maturity: IPython Parallel has been around for a while.
- Explicit control over the worker processes: IPython parallelallows you to execute arbitrary statements on the workers, allowing it toserve in system administration tasks.
- Deployment help: IPython Parallel has mechanisms built-in to aiddeployment on SGE, MPI, etc.. Distributed does not have any such sugar,though is fairly simple to set up by hand.
- Various other advantages: Over the years IPython parallel has accrued avariety of helpful features like IPython interaction magics,
@parallel
decorators, etc..
concurrent.futures
The distributed.Client
API is modeled after concurrent.futures
and PEP 3184. It has a few notable differences:
distributed
acceptsFuture
objects withincalls tosubmit/map
. When chaining computations, it is preferable tosubmit Future objects directly rather than wait on them before submission.- The
map()
method returnsFuture
objects, not concrete results.Themap()
method returns immediately. - Despite sharing a similar API,
distributed
Future
objects cannot always be substituted forconcurrent.futures.Future
objects, especially when usingwait()
oras_completed()
. - Distributed generally does not support callbacks.
If you need full compatibility with the concurrent.futures.Executor
API, use the object returned by theget_executor()
method.