Chapter 8 Parallel Pipelines - 8.5 Discussion - 《Data Science at the Command Line》

8.5 Discussion

8.5 Discussion

As data scientists, we work with data, and sometimes a lot of data. This means that sometimes we need to run a command multiple times or distribute data-intensive commands over multiple cores. In this chapter we have shown you how easy it is to parallelize commands. GNU Parallel is a very powerful and flexible tool to speed up ordinary command-line tools and distribute them over multiple cores and remote machines. It offers a lot of functionality and in this chapter we’ve only been able to scratch the surface. Some features of GNU Parallel are that we haven’t covered:

Different ways of specifying input.
Keep a log of all the jobs.
Only start new jobs when the machine is under a certain load.
Timeout, resume, and retry jobs.Once you have a basic understanding of GNU Parallel and its most important options, we recommend that you take a look at its tutorial listed in the Further Reading section.