Chapter 6 Managing Your Data Workflow - 6.4 Obtain Top E-books from Project Gutenberg - 《Data Science at the Command Line》

6.4 Obtain Top E-books from Project Gutenberg

6.4 Obtain Top E-books from Project Gutenberg

For the remainder of this chapter, we’ll use the following task as a running example. Our goal is to turn the command that we use to solve this task into a Drake workflow. We start out simple, and work our way towards an advanced workflow in order to explain to you the various concepts and syntax of Drake.

Project Gutenberg is an ambitious project that, since 1971, has archived and digitized over 42,000 books and offers these as free e-books. On its website you can find the top hundred most downloaded books. Let’s assume that we are interested in the top five downloads of Project Gutenberg. Because this list is available in HTML it is straightforward to obtain the top five downloads:

$ curl -s 'http://www.gutenberg.org/browse/scores/top' |  
> grep -E '^<li>' |                                       
> head -n 5 |                                             
> sed -E "s/.*ebooks\/([0-9]+).*/\\1/" > data/top-5

This command:

Downloads the HTML.
Extracts the list items.
Keeps only the top five items.
Saves e-book IDs to data/top-5.The output of the command is:

$ cat data/top-5
1342
76
11
1661
1952

If you want to be able to reproduce this, that is, once again at a later time, the easiest thing you can do is put this command in a script as we’ve seen in Chapter 4. If you execute this script again, the HTML will be downloaded again as well. There are three common reasons why you might want to be able to control whether certain steps are run. First, because this step may take a very long time. Second, because you want to continue with the same data. Third, the data may come from an API which has certain rate limits. It would be a good idea to let one step save the data to a file and then let subsequent steps operate on that file so that you don’t have to make any redundant computations or API calls. Now, the first reason is not really a problem in our example because the HTML can be downloaded fast enough. However, in some cases the data may come from other sources and may comprise of gigabytes of data.