Chapter 8 Parallel Pipelines - 8.3 Parallel Processing - 《Data Science at the Command Line》

8.3 Parallel Processing

8.3 Parallel Processing

Assume that we have a very long running command, such as the one shown in Example 8.1.

Example 8.1 (~/book/ch08/slow.sh)

#!/bin/bash
echo "Starting job $1"
duration=$((1+RANDOM%5))                
sleep $duration
echo "Job $1 took ${duration} seconds"

$RANDOM is an internal Bash function that returns a pseudorandom integer between 0 and 32767. Taking the remainder of the division of that number by 5 and adding 1 ensures that the number is between 1 and 5.This process does not take up all the resources we have available. And it so happens that we need to run this command a lot of times. For example, we need to download a whole sequence of files.

A naive way to parallelize is to run the commands in the background:

$ cd ~/book/ch08
$ for i in {1..4}; do
> (slow.sh $i; echo Processed $i) &  
> done
[1] 3334
[2] 3335
[3] 3336
[4] 3338
$ Starting job 2
Starting job 1
Starting job 3
Starting job 4
Job 4 took 1 seconds
Processed 4
Job 3 took 4 seconds
Job 2 took 4 seconds
Processed 3
Processed 2
Job 1 took 4 seconds
Processed 1

Parentheses create a subshell. The ampersand ensures that it will be executed in the background.The problem with subshells is that they are executed all at once. There is no mechanism to control the maximum number of processes. You are not advised to use this.

$ while read i; do
> (slow.sh "$i"; ) &
> done < data/movies.txt
[1] 3404
[2] 3405
[3] 3406
Starting job Star Wars
Starting job Matrix
Starting job Home Alone
[4] 3407
[5] 3410
$ Starting job Back to the Future
Starting job Indiana Jones
Job Home Alone took 2 seconds
Job Matrix took 2 seconds
Job Star Wars took 2 seconds
Job Back to the Future took 3 seconds
Job Indiana Jones took 4 seconds

Not everything can be parallelized: API calls may be limited to a certain number, or some commands can only have one instance.

Quoting is important. If we did not quote $i, then only the first word of each movie would have been passed to the script slow.sh.

There are two problems with this naive approach. First, there’s no way to control how many processes you are running concurrently. Second, logging: which output belongs to which input.

$ < data/movies parallel -j3 slow.sh "{}"
Starting job Star Wars
Job Star Wars took 3 seconds
Starting job Home Alone
Job Home Alone took 3 seconds
Starting job Matrix
Job Matrix took 4 seconds
Starting job Indiana Jones
Job Indiana Jones took 1 seconds
Starting job Back to the Future
Job Back to the Future took 5 seconds

8.3.1 Introducing GNU Parallel

GNU Parallel is a command-line tool written by Ole Tange. This tool allows us to parallelize commands and pipelines. The beauty of this tool is that existing tools can be used as they are; they do not need to be modified.

You may have noticed that we keep writing GNU Parallel. That’s because there are two tools with the name “parallel”. If you make use of the Data Science Toolbox then you already have the correct one installed. Otherwise, please double check that you have installed the correct tool installed by running parallel —version.

Before we go into the details of GNU Parallel, here’s a little teaser to show you how easy it is to parallelize the for loop stated above:

$ seq 5 | parallel "echo {}^2 | bc"
1
4
9
16
25

This is parallel in its simplest form: without any arguments. As you can see it basically acts as a for loop. (We’ll explain later what is going on exactly.) With no less than 110 command-line arguments (!), GNU Parallel offers a lot of additional functionality. Don’t worry, by the end of this chapter, you’ll have a solid understanding of the most important ones.

Install GNU Parallel by running the following commands:

$ wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
$ tar -xvjf parallel-latest.tar.bz2 > extracted-files
$ cd $(head -n 1 extracted-files)
$ ./configure && make && sudo make install

You can verify that you have correctly installed GNU Parallel:

$ parallel --version | head -n 1
GNU parallel 20140622

You can safely delete the created files and directories.

$ cd ..
$ rm -r $(head -n 1 extracted-files)
$ rm parallel-latest.tar.bz2 extracted-files

If you use parallel as often as us then you may want to create an alias (for example p) by adding alias p=parallel to your .bashrc. (In this chapter we’ll just use parallel for clarity.)

8.3.2 Specifying Input

The most important argument to GNU Parallel, is the command that you would like to run for every input. The question is: where should the input item be inserted in the command line? If you do not specify anything, then the input item will be appended to the command. While this is usually what you want, we advise you to be explicit about where the input item should be inserted in the command using one or more placeholders.

There are many ways to provide input to GNU Parallel. We prefer piping the input (as we do throughout this chapter) because that is generally applicable and allows us to construct a pipeline from left to right. Please consult the man page of parallel to read about other ways to provide input.

In most cases, you probably want to use the entire input item as it is. For this, you only need one placeholder. You specify the placeholder, in other words, where to put the input item, with two curly braces:

$ seq 5 | parallel echo {}

When the input item is a file, there are a couple of special placeholders you can use to modify the file name. For example, with {./}, only the base name of the file name will be used.

If the input line has multiple parts separated by a delimiter you can add numbers to the placeholders. For example:

$ < input.csv | parallel -C, "mv {1} {2}"

Here, you can apply the same placeholder modifiers. It is also possible to reuse the same input item. If the input to parallel is a CSV file with a header, then you can use the column names as placeholders:

$ < input.csv | parallel -C, --header : "invite {name} {email}"

Sometimes you just want to run the same command without any changing inputs. This is also possible in parallel. We just have to specify the -N0 parameter and give as input as many lines as you want to execute:

$ seq 5 | parallel -N0 "echo The command line rules"
The command line rules
The command line rules
The command line rules
The command line rules

If you ever wonder whether your GNU Parallel command is set up correctly, you can add the —dryrun option. Instead of actually executing the command, GNU Parallel will print out all the commands exactly as if they would have been executed.

8.3.3 Controlling the Number of Concurrent Jobs

By default, parallel runs one job per CPU core. You can control the number of jobs that will be run in parallel with the -j command-line argument, which is short for jobs. Simply specifying a number means that many jobs will be run in parallel. If you put a plus sign in front of the number then parallel will run N jobs plus the number of CPU cores. If you put a minus sign in front of the number then parallel will run N-M jobs. Where N is the number of CPU cores. You can also specify a percentage to the -j parameter. So, the default is 100% of the number of CPU cores. The optimal number of jobs to run in parallel depends on the actual commands you are running.

$ seq 5 | parallel -j0 "echo Hi {}"
Hi 1
Hi 2
Hi 3
Hi 4
Hi 5

$ seq 5 | parallel -j200% "echo Hi {}"
Hi 1
Hi 2
Hi 3
Hi 4
Hi 5

If you specify -j1, then the commands will be run in serial. Even though this doesn’t do the name of the tool of justice, it still has its uses. For example, when you need to access an API which only allows one connection at a time. If you specify -j0, then parallel will run as many jobs in parallel as possible. This can be compared to our loop with subshells. This is not advised.

8.3.4 Logging and Output

To save the output of each command, you might be tempted to the following:

$ seq 5 | parallel "echo \"Hi {}\" > data/ch08/hi-{}.txt"

This will save the output into individual files. Or, if you want to save everything into one big file you could do the following:

$ seq 5 | parallel "echo Hi {}" >> data/ch08/one-big-file.txt

However, GNU Parallel offers the —results option, which stores the output of each job into a separate file, where the filename is based on the input values:

$ seq 5 | parallel --results data/ch08/outdir "echo Hi {}"
Hi 1
Hi 2
Hi 3
Hi 4
Hi 5
$ find data/ch08/outdir
data/ch08/outdir
data/ch08/outdir/1
data/ch08/outdir/1/1
data/ch08/outdir/1/1/stderr
data/ch08/outdir/1/1/stdout
data/ch08/outdir/1/3
data/ch08/outdir/1/3/stderr
data/ch08/outdir/1/3/stdout
data/ch08/outdir/1/5
data/ch08/outdir/1/5/stderr
data/ch08/outdir/1/5/stdout
data/ch08/outdir/1/2
data/ch08/outdir/1/2/stderr
data/ch08/outdir/1/2/stdout
data/ch08/outdir/1/4
data/ch08/outdir/1/4/stderr
data/ch08/outdir/1/4/stdout

When you’re running multiple jobs in parallel, the order in which the jobs are run may not correspond to the order of the input. The output of jobs is therefore also mixed up. To keep the same order, simply specify the —keep-order option or -k option.

Sometimes it’s useful to record which input generated which output. GNU Parallel allows you to tag the output with the —tag option:

$ seq 5 | parallel --tag "echo Hi {}"
1       Hi 1
2       Hi 2
3       Hi 3
4       Hi 4
5       Hi 5

8.3.5 Creating Parallel Tools

The bc tool, which we used in the beginning of the chapter, is not parallel by itself. However, we can parallelize it using parallel. The Data Science toolbox contains a tool called pbc (Janssens 2014 d). Its code is shown in Example 8.2.

Example 8.2 (Parallel bc)

#!/usr/bin/env bash
parallel -C, -k -j100% "echo '$1' | bc -l"

This tool allows us to simplify the code used in the beginning of the chapter too:

$ seq 100 | pbc '{1}^2' | tail
8281
8464
8649
8836
9025
9216
9409
9604
9801
10000