Chapter 7 Exploring Data - 7.3 Computing Descriptive Statistics - 《Data Science at the Command Line》

7.3 Computing Descriptive Statistics
- 7.3.1 csvstat
- 7.3.2 Using R from the Command Line using Rio

7.3 Computing Descriptive Statistics

7.3.1 csvstat

The command-line tool csvstat gives a lot of information. For each feature (column), it shows:

The data type in Python terminology (see Table 7-1 for a comparison between Python and SQL data types).
Whether it has any missing values (nulls).
The number of unique values.
Various descriptive statistics (maximum, minimum, sum, mean, standard deviation, and median) for those features for which it is appropriate.We invoke csvstat as follows:

$ csvstat data/datatypes.csv
  1. a
        <type 'int'>
        Nulls: False
        Values: 2, 66, 42
  2. b
        <type 'float'>
        Nulls: True
        Values: 0.0, 3.1415
  3. c
        <type 'bool'>
        Nulls: False
        Unique values: 2
        5 most frequent values:
                False:  2
                True:   1
  4. d
        <type 'unicode'>
        Nulls: False
        Values: 2198, "Yes!", Oh, good
  5. e
        <type 'datetime.datetime'>
        Nulls: True
        Values: 2011-11-11 11:00:00, 2014-09-15 00:00:00
  6. f
        <type 'datetime.date'>
        Nulls: True
        Values: 2012-09-08, 1970-12-06
  7. g
        <type 'datetime.time'>
        Nulls: True
        Values: 12:34:00, 12:07:00
Row count: 3

This gives a very verbose output. For a more concise output specify one of the statistics arguments:

—max (maximum)
—min (minimum)
—sum (sum)
—mean (mean)
—median (median)
—stdev (standard deviation)
—nulls (whether column contains nulls)
—unique (unique values)
—freq (frequent values)
—len (max value length)

For example:

$ csvstat data/datatypes.csv --null
  1. a: False
  2. b: True
  3. c: False
  4. d: False
  5. e: True
  6. f: True
  7. g: True

You can select a subset of features with the -c command-line argument. This accepts both integers and column names:

$ csvstat data/investments2.csv -c 2,13,19,24
  2. company_name
        <type 'unicode'>
        Nulls: True
        Unique values: 27324
        5 most frequent values:
                Aviir:  13
                Galectin Therapeutics:  12
                Rostima:        12
                Facebook:       11
                Lending Club:   11
        Max length: 66
 13. investor_country_code
        <type 'unicode'>
        Nulls: True
        Unique values: 111
        5 most frequent values:
                USA:    20806
                GBR:    2357
                DEU:    946
                CAN:    893
                FRA:    737
        Max length: 15
 19. funding_round_code
        <type 'unicode'>
        Nulls: True
        Unique values: 15
        5 most frequent values:
                a:      7529
                b:      4776
                c:      2452
                d:      1042
                e:      384
        Max length: 10
 24. raised_amount_usd
        <type 'int'>
        Nulls: True
        Min: 0
        Max: 3200000000
        Sum: 359891203117
        Mean: 10370010.1748
        Median: 3250000
        Standard Deviation: 38513119.1802
        Unique values: 6143
        5 most frequent values:
                10000000:       1159
                1000000:        1074
                5000000:        1066
                2000000:        875
                3000000:        820
Row count: 41799

Please note that csvstat, just like csvsql, employs heuristics to determine the data type, and therefore may not always get it right. We encourage you to always do a manual inspection as discussed in the previous subsection. Moreover, the type may be a character string or integer that doesn’t say anything about how it should be used.

As a nice extra, csvstat outputs, at the very end, the number of data points (rows). Newlines and commas inside values are handles correctly. To only see the relevant line, we can use tail:

$ csvstat data/iris.csv | tail -n 1

If you only want to see the actual number number of data points, you can use, for example, the following sed expression to extract the number:

$ csvstat data/iris.csv | sed -rne '${s/^([^:]+): ([0-9]+)$/\2/;p}'

7.3.2 Using R from the Command Line using Rio

In this section we would like to introduce you to a command-line tool called Rio, which is essentially a small, nifty wrapper around the statistical programming environment R. Before we explain what Rio does and why it exists, lets talk a bit about R itself.

R is a very powerful statistical software package to analyze data and create visualizations. It’s an interpreted programming language, has an extensive collection of packages, and offers its own REPL (Read-Eval-Print-Loop), which allows you, similar to the command line, to play with your data. Unfortunately, R is quite separated from the command line. Once you start it, you’re in a separate environment. R doesn’t really play well with the command line because you cannot pipe any data into it and it also doesn’t support any one-liners that you can specify.

For example, imagine that you have a CSV file called tips.csv, and you would like compute the tip percentage, and save the result. To accomplish this in R you would first startup R:

$ #? [echo]
$ R

And then run the following commands:

> tips <- read.csv('tips.csv', header = T, sep = ',', stringsAsFactors = F)
> tips.percent <- tips$tip / tips$bill * 100
> cat(tips.percent, sep = '\n', file = 'percent.csv')
> q("no")

Afterwards, you can continue with the saved file percent.csv on the command line. Note that there is only one command that is associated with what we want to accomplish specifically. The other commands are necessary boilerplate. Typing in this boilerplate in order to accomplish something simple is cumbersome and breaks your workflow. Sometimes, you only want to do one or two things at a time to your data. Wouldn’t it be great if we could harness the power of R and be able to use it from the command line?

This is where Rio comes in. The name Rio stands for R input/output, because it enables you to use R as a filter on the command line. You simply pipe CSV data into Rio and you specify the R commands that you want to run on it. Let’s perform the same task as before, but now using Rio:

$ < data/tips.csv Rio -e 'df$tip / df$bill * 100' | head -n 10

Rio can execute multiple R command that are separated by semicolons. So, if you wanted to add a column called percent to the input data, you could do the following:

$ < data/tips.csv Rio -e 'df$percent <- df$tip / df$bill * 100; df' | head

These small one-liners are possible because Rio takes care of all the boilerplate. Being able to use the command line for this and capture the power of R into a one-liner is fantastic, especially if you want to keep on working on the command line. Rio assumes that the input data is in CSV format with a header. (By specifying the -n command-line argument Rio does not consider the first row to be the header and creates default column names.) Behind the scenes, Rio writes the piped data to a temporary CSV file and creates a script that:

Import required libraries.
Loads the CSV file as a data frame.
Generates a ggplot2 object if needed (more on this in the next section).
Runs the specified commands.
Prints the result of the last command to standard output.

So now, if you wanted to do one or two things to your data set with R, you can specify it as a one-liner, and keep on working on the command line. All the knowledge that you already have about R can now be used from the command line. With Rio, you can even create sophisticated visualizations, as you will see later in this chapter.

Rio doesn’t have to be used as a filter, meaning the output doesn’t have to be a in CSV format per se. You can compute

$ < data/iris.csv Rio -e 'mean(df$sepal_length)'
$ < data/iris.csv Rio -e 'sd(df$sepal_length)'
$ < data/iris.csv Rio -e 'sum(df$sepal_length)'

If we wanted to compute the five summary statistics, we would do:

$ < iris.csv Rio -e 'summary(df$sepal_length)'
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.300   5.100   5.800   5.843   6.400   7.900

You can also compute the skewness (symmetry of the distribution) and kurtosis (peakedness of the distribution), but then you need to have the moments package installed:

$ #? [echo]
$ < data/iris.csv Rio -e 'skewness(df$sepal_length)'
$ < data/iris.csv Rio -e 'kurtosis(df$petal_width)'

Correlation between two features:

$ < tips.csv Rio -e 'cor(df$bill, df$tip)'
0.6757341

Or a correlation matrix:

$ < data/tips.csv csvcut -c bill,tip | Rio -f cor | csvlook
|--------------------+--------------------|
|  bill              | tip                |
|--------------------+--------------------|
|  1                 | 0.675734109211365  |
|  0.675734109211365 | 1                  |
|--------------------+--------------------|

Note that with the command-line argument -f, we can specify the function to apply to the data frame df. In this case, it is the same as -e cor(df).

You can even create a stem plot (Tukey 1977) using Rio:

$ < data/iris.csv Rio -e 'stem(df$sepal_length)'