Launch Tasks from Tasks
Sometimes it is convenient to launch tasks from other tasks.For example you may not know what computations to run until you have theresults of some initial computations.
Motivating example
We want to download one piece of data and turn it into a list. Then we want tosubmit one task for every element of that list. We don’t know how long thelist will be until we have the data.
So we send off our original download_and_convert_to_list
function, whichdownloads the data and converts it to a list on one of our worker machines:
- future = client.submit(download_and_convert_to_list, uri)
But now we need to submit new tasks for individual parts of this data. We havethree options.
- Gather the data back to the local process and then submit new jobs from thelocal process
- Gather only enough information about the data back to the local process andsubmit jobs from the local process
- Submit a task to the cluster that will submit other tasks directly fromthat worker
Gather the data locally
If the data is not large then we can bring it back to the client to perform thenecessary logic on our local machine:
- >>> data = future.result() # gather data to local process
- >>> data # data is a list
- [...]
- >>> futures = e.map(process_element, data) # submit new tasks on data
- >>> analysis = e.submit(aggregate, futures) # submit final aggregation task
This is straightforward and, if data
is small then it is probably thesimplest, and therefore correct choice. However, if data
is large then wehave to choose another option.
Submit tasks from client
We can run small functions on our remote data to determine enough to submit theright kinds of tasks. In the following example we compute the len
functionon data
remotely and then break up data into its various elements.
- >>> n = client.submit(len, data) # compute number of elements
- >>> n = n.result() # gather n (small) locally
- >>> from operator import getitem
- >>> elements = [client.submit(getitem, data, i) for i in range(n)] # split data
- >>> futures = client.map(process_element, elements)
- >>> analysis = client.submit(aggregate, futures)
We compute the length remotely, gather back this very small result, and thenuse it to submit more tasks to break up the data and process on the cluster.This is more complex because we had to go back and forth a couple of timesbetween the cluster and the local process, but the data moved was very small,and so this only added a few milliseconds to our total processing time.
Extended Example
Computing the Fibonacci numbers creates involves a recursive function. When thefunction is run, it calls itself using values it computed. We will use this asan example throughout this documentation to illustrate different techniques ofsubmitting tasks from tasks.
- def fib(n):
- if n < 2:
- return n
- a = fib(n - 1)
- b = fib(n - 2)
- return a + b
- print(fib(10)) # prints "55"
We will use this example to show the different interfaces.
Submit tasks from worker
Note: this interface is new and experimental. It may be changed withoutwarning in future versions.
We can submit tasks from other tasks. This allows us to make decisions whileon worker nodes.
To submit new tasks from a worker that worker must first create a new clientobject that connects to the scheduler. There are three options for this:
dask.delayed
anddask.compute
get_client
withsecede
andrejoin
worker_client
dask.delayed
The Dask delayed behaves as normal: it submits the functions to the graph,optimizes for less bandwidth/computation and gathers the results. For moredetail, see dask.delayed.
- from distributed import Client
- from dask import delayed, compute
- @delayed
- def fib(n):
- if n < 2:
- return n
- # We can use dask.delayed and dask.compute to launch
- # computation from within tasks
- a = fib(n - 1) # these calls are delayed
- b = fib(n - 2)
- a, b = compute(a, b) # execute both in parallel
- return a + b
- if __name__ == "__main__":
- # these features require the dask.distributed scheduler
- client = Client()
- result = fib(10).compute()
- print(result) # prints "55"
Getting the client on a worker
The get_client
function provides a normalClient object that gives full access to the dask cluster, including the abilityto submit, scatter, and gather results.
- from distributed import Client, get_client, secede, rejoin
- def fib(n):
- if n < 2:
- return n
- client = get_client()
- a_future = client.submit(fib, n - 1)
- b_future = client.submit(fib, n - 2)
- a, b = client.gather([a_future, b_future])
- return a + b
- if __name__ == "__main__":
- client = Client()
- future = client.submit(fib, 10)
- result = future.result()
- print(result) # prints "55"
However, this can deadlock the scheduler if too many tasks request jobs atonce. Each task does not communicate to the scheduler that they are waiting onresults and are free to compute other tasks. This can deadlock the cluster ifevery scheduling slot is running a task and they all request more tasks.
To avoid this deadlocking issue we can use secede
and rejoin
. Thesefunctions will remove and rejoin the current task from the clusterrespectively.
- def fib(n):
- if n < 2:
- return n
- client = get_client()
- a_future = client.submit(fib, n - 1)
- b_future = client.submit(fib, n - 2)
- secede()
- a, b = client.gather([a_future, b_future])
- rejoin()
- return a + b
Connection with context manager
The worker_client
function performs thesame task as get_client
, but is implementedas a context manager. Using worker_client
as a context manager ensures proper cleanup on theworker.
- from dask.distributed import worker_client
- def fib(n):
- if n < 2:
- return n
- with worker_client() as client:
- a_future = client.submit(fib, n - 1)
- b_future = client.submit(fib, n - 2)
- a, b = client.gather([a_future, b_future])
- return a + b
- if __name__ == "__main__":
- client = Client()
- future = client.submit(fib, 10)
- result = future.result()
- print(result) # prints "55"
Tasks that invoke worker_client
areconservatively assumed to be long running. They can take a long time,waiting for other tasks to finish, gathering results, etc. In order to avoidhaving them take up processing slots the following actions occur whenever atask invokes worker_client
.
- The thread on the worker running this function secedes from the threadpool and goes off on its own. This allows the thread pool to populate thatslot with a new thread and continue processing additional tasks withoutcounting this long running task against its normal quota.
- The Worker sends a message back to the scheduler temporarily increasing itsallowed number of tasks by one. This likewise lets the scheduler allocatemore tasks to this worker, not counting this long running task against it.
Establishing a connection to the scheduler takes a few milliseconds and so itis wise for computations that use this feature to be at least a few timeslonger in duration than this.