Chapter 1 Introduction - 1.6 A Real-world Use Case - 《Data Science at the Command Line》

1.6 A Real-world Use Case

1.6 A Real-world Use Case

In the previous sections, we’ve given you a definition of data science and explained to you why the command line can be a great environment for doing data science. Now it’s time to demonstrate the power and flexibility of the command line through a real-world use case. We’ll go pretty fast, so don’t worry if some things don’t make sense yet.

Personally, we never seem to remember when Fashion Week is happening in New York. We know it’s held tiwce a year, but every time it comes as a surprise! In this section we’ll consult the wonderful API of The New York Times to figure out when it’s being held. Once you have obtained your own API keys on the developer website, you’ll be able to, for example, search for articles, get the list of best sellers, and see a list of events.

The particular API endpoint that we’re going to query is the article search one. We expect that a spike in the amount of coverage in The New York Times about New York Fashion week indicates whether it’s happening. The results from the API are paginated, which means that we have to execute the same query multiple times but with different page number. (It’s like clicking Next on a search engine.) This is where GNU Parallel (Tange 2014) comes in real handy because it can act as a for loop. The entire command looks as follows (don’t worry about all the command-line arguments given to parallel; we’re going to discuss this in great detail in Chapter 8:

$ cd ~/book/ch01/data
$ parallel -j1 --progress --delay 0.1 --results results "curl -sL "\
> "'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=New+York+'"\
> "'Fashion+Week&begin_date={1}0101&end_date={1}1231&page={2}&api-key='"\
> "'<your-api-key>'" ::: {2009..2013} ::: {0..99} > /dev/null
Computers / CPU cores / Max jobs to run
1:local / 4 / 1
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:1/9/100%/0.4s

Basically, we’re performing the same query for years 2009-2014. The API only allows up to 100 pages (starting at 0) per query, so we’re generating 100 numbers using brace expansion. These numbers are used by the page parameter in the query. We’re searching for articles that contain the search term New+York+Fashion+Week. Because the API has certain limits, we ensure that there’s only one request at a time, with a one-second delay between them. Make sure that you replace <your-api-key> with your own API key for the article search endpoint.

Each request returns 10 articles, so that’s 1,000 articles in total. These are sorted by page views, so this should give us a good estimate of the coverage. The results are in JSON format, which we store in the results directory. The command-line tool tree (Baker 2014) gives an overview of how the subdirectories are structured:

$ tree results | head
results
└── 1
    ├── 2009
    │   └── 2
    │       ├── 0
    │       │   ├── stderr
    │       │   └── stdout
    │       ├── 1
    │       │   ├── stderr
    │       │   └── stdout

We can combine and process the results using cat (Granlund and Stallman 2012 a), jq (Dolan 2014), and json2csv (Czebotar 2014):

$ cat results/1/*/2/*/stdout |                                             
> jq -c '.response.docs[] | {date: .pub_date, type: .document_type, '\     
> 'title: .headline.main }' | json2csv -p -k date,type,title > fashion.csv

Let’s break down this command:

We combine the output of each of the 500 parallel jobs (or API requests).
We use jq to extract the publication date, the document type, and the headline of each article.
We convert the JSON data to CSV using json2csv and store it as fashion.csv.With wc -l (Rubin and MacKenzie 2012), we find out that this data set contains 4,855 articles (and not 5,000 because we probably retrieved everything from 2009):

$ wc -l fashion.csv
4856 fashion.csv

Let’s inspect the first 10 articles to verify that we have succeeded in obtaining the data. Note that we’re applying cols (Janssens 2014 b) and cut (Ihnat, MacKenzie, and Meyering 2012) to the date column in order to leave out the time and timezone information in the table:

$ < fashion.csv cols -c date cut -dT -f1 | head | csvlook
|-------------+------------+-----------------------------------------|
|  date       | type       | title                                   |
|-------------+------------+-----------------------------------------|
|  2009-02-15 | multimedia | Michael Kors                            |
|  2009-02-20 | multimedia | Recap: Fall Fashion Week, New York      |
|  2009-09-17 | multimedia | UrbanEye: Backstage at Marc Jacobs      |
|  2009-02-16 | multimedia | Bill Cunningham on N.Y. Fashion Week    |
|  2009-02-12 | multimedia | Alexander Wang                          |
|  2009-09-17 | multimedia | Fashion Week Spring 2010                |
|  2009-09-11 | multimedia | Of Color | Diversity Beyond the Runway  |
|  2009-09-14 | multimedia | A Designer Reinvents Himself            |
|  2009-09-12 | multimedia | On the Street | Catwalk                 |
|-------------+------------+-----------------------------------------|

That seems to have worked! In order to gain any insight, we’d better visualize the data. Figure 1.3 contains a line graph created with R (R Foundation for Statistical Computing 2014), Rio (Janssens 2014 e), and ggplot2 (Wickham 2009).

$ < fashion.csv Rio -ge 'g + geom_freqpoly(aes(as.Date(date), color=type), '\
> 'binwidth=7) + scale_x_date() + labs(x="date", title="Coverage of New York'\
> ' Fashion Week in New York Times")' | display

Figure 1.3: Coverage of New York Fashion Week in the New York Times

By looking at the line graph we can infer that New York Fashion Week happens two times per year. And now we know when: once in February and once in September. Let’s hope that it’s going to be the same this year so that we can prepare ourselves! In any case, we hope that with this example, we’ve shown that The New York Times API is an interesting source of data. More importantly, we hope that we’ve convinced you that the command line can be a very powerful approach for doing data science.

In this section we’ve peeked at some important concepts and some exciting command-line tools. Don’t worry if some things don’t make sense yet. Most of the concepts will be discussed in Chapter 2, and in the subsequent chapters we’ll go into more detail for all the command-line tools used in this section.