Chapter 6 Managing Your Data Workflow - 6.6 Well, That Depends - 《Data Science at the Command Line》

6.6 Well, That Depends

6.6 Well, That Depends

Our workflow contains just a single step, which means that, just like having a simple Bash script, everything will be executed all the time. So the first thing we are going to do is to split up this single step into two steps, where the first step downloads the HTML, and the second step processes this HTML. The second step obviously depends on the first step. We can define this dependency in our workflow.

You may have noticed that the number 5 is specified three times. If you ever wanted to get the top, say, top 10 e-books from Project Gutenberg, we would have to change our workflow in three places. This is inefficient and needs to be addressed. Luckily, Drake supports variables.

It may not be immediately obvious from our workflow, but our data resides in the same location as the script. It is better to have the data live in a separate location and have it separated from any code that generates this data. Not only does it keep our project cleaner, it also allows us to delete the generated data files easier, and we can easily specify that we do not like the data files to be included in any version control system such as git (Torvalds and Hamano 2014). Let’s have a look:

NUM:=5                                                              
BASE=data/                                                          
top.html <- [-timecheck]
    curl -s 'http://www.gutenberg.org/browse/scores/top' > $OUTPUT  
top-$[NUM] <- top.html                                              
    < $INPUT grep -E '^<li>' |
    head -n $[NUM] |
    sed -E "s/.*ebooks\/([0-9]+)\">([^<]+)<.*/\\1,\\2/" > $OUTPUT

You can specify variables in Drake, preferably at the beginning of the file, by specifying the variable name, then an equal sign, and then the value. The name of the variable doesn’t have to be in all capitals, but it does make them stand out more. As you can see, we have used for the variable NUM the := instead of =. This means that if the variable NUM is already set, it will not be overridden. This allows us to specify the value of NUM from the command line before we run Drake.
The BASE variable is a special variable. Drake will treat every file specified in the workflow as if it were in this base directory.
We now have two steps. The first step has the same input as before, but now the output is a different file, namely, top.html. This output is defined again as the input of step two. This is how Drake knows that the second step depends on the first step.
We have used two more special variables: INPUT and OUTPUT. Values of these two special variables are set to what we have defined as the input and output of that step, respectively. This way, we don’t have to specify the input and output of a certain step twice. Furthermore, it allows us to easily reuse certain steps in any future workflows.Let’s execute this new workflow using Drake:

$ drake -w 02.drake
The following steps will be run, in order:
  1: ../../data/top.html <-  [missing output]
  2: ../../data/top-5 <- ../../data/top.html [projected timestamped]
Confirm? [y/n] y
Running 2 steps with concurrence of 1...
--- 0. Running (missing output): ../../data/top.html <-
--- 0: ../../data/top.html <-  -> done in 0.89s
--- 1. Running (missing output): ../../data/top-5 <- ../../data/top.html
--- 1: ../../data/top-5 <- ../../data/top.html -> done in 0.02s
Done (2 steps run).

Now, let’s assume that we want instead of the top 5 e-books, the top 10 e-books. We can set the NUM variable from the command line and run Drake again:

$ NUM=10 drake -w 02.drake
The following steps will be run, in order:
  1: ../../data/top-10 <- ../../data/top.html [missing output]
Confirm? [y/n] y
Running 1 steps with concurrence of 1...
--- 1. Running (missing output): ../../data/top-10 <- ../../data/top.html
--- 1: ../../data/top-10 <- ../../data/top.html -> done in 0.02s
Done (1 steps run).

As you can see, Drake now only needs to execute the second step, because the output of the first step has already been satisfied. Again, downloading an HTML file is not such a big deal, but can you imagine the implications if you were dealing with 10 GB worth of data?