7.3 Computing Descriptive Statistics
7.3.1 csvstat
The command-line tool csvstat
gives a lot of information. For each feature (column), it shows:
- The data type in Python terminology (see Table 7-1 for a comparison between Python and SQL data types).
- Whether it has any missing values (nulls).
- The number of unique values.
- Various descriptive statistics (maximum, minimum, sum, mean, standard deviation, and median) for those features for which it is appropriate.We invoke
csvstat
as follows:
$ csvstat data/datatypes.csv
1. a
<type 'int'>
Nulls: False
Values: 2, 66, 42
2. b
<type 'float'>
Nulls: True
Values: 0.0, 3.1415
3. c
<type 'bool'>
Nulls: False
Unique values: 2
5 most frequent values:
False: 2
True: 1
4. d
<type 'unicode'>
Nulls: False
Values: 2198, "Yes!", Oh, good
5. e
<type 'datetime.datetime'>
Nulls: True
Values: 2011-11-11 11:00:00, 2014-09-15 00:00:00
6. f
<type 'datetime.date'>
Nulls: True
Values: 2012-09-08, 1970-12-06
7. g
<type 'datetime.time'>
Nulls: True
Values: 12:34:00, 12:07:00
Row count: 3
This gives a very verbose output. For a more concise output specify one of the statistics arguments:
—max
(maximum)—min
(minimum)—sum
(sum)—mean
(mean)—median
(median)—stdev
(standard deviation)—nulls
(whether column contains nulls)—unique
(unique values)—freq
(frequent values)—len
(max value length)
For example:
$ csvstat data/datatypes.csv --null
1. a: False
2. b: True
3. c: False
4. d: False
5. e: True
6. f: True
7. g: True
You can select a subset of features with the -c
command-line argument. This accepts both integers and column names:
$ csvstat data/investments2.csv -c 2,13,19,24
2. company_name
<type 'unicode'>
Nulls: True
Unique values: 27324
5 most frequent values:
Aviir: 13
Galectin Therapeutics: 12
Rostima: 12
Facebook: 11
Lending Club: 11
Max length: 66
13. investor_country_code
<type 'unicode'>
Nulls: True
Unique values: 111
5 most frequent values:
USA: 20806
GBR: 2357
DEU: 946
CAN: 893
FRA: 737
Max length: 15
19. funding_round_code
<type 'unicode'>
Nulls: True
Unique values: 15
5 most frequent values:
a: 7529
b: 4776
c: 2452
d: 1042
e: 384
Max length: 10
24. raised_amount_usd
<type 'int'>
Nulls: True
Min: 0
Max: 3200000000
Sum: 359891203117
Mean: 10370010.1748
Median: 3250000
Standard Deviation: 38513119.1802
Unique values: 6143
5 most frequent values:
10000000: 1159
1000000: 1074
5000000: 1066
2000000: 875
3000000: 820
Row count: 41799
Please note that csvstat
, just like csvsql
, employs heuristics to determine the data type, and therefore may not always get it right. We encourage you to always do a manual inspection as discussed in the previous subsection. Moreover, the type may be a character string or integer that doesn’t say anything about how it should be used.
As a nice extra, csvstat
outputs, at the very end, the number of data points (rows). Newlines and commas inside values are handles correctly. To only see the relevant line, we can use tail
:
$ csvstat data/iris.csv | tail -n 1
If you only want to see the actual number number of data points, you can use, for example, the following sed
expression to extract the number:
$ csvstat data/iris.csv | sed -rne '${s/^([^:]+): ([0-9]+)$/\2/;p}'
7.3.2 Using R from the Command Line using Rio
In this section we would like to introduce you to a command-line tool called Rio
, which is essentially a small, nifty wrapper around the statistical programming environment R. Before we explain what Rio does and why it exists, lets talk a bit about R itself.
R is a very powerful statistical software package to analyze data and create visualizations. It’s an interpreted programming language, has an extensive collection of packages, and offers its own REPL (Read-Eval-Print-Loop), which allows you, similar to the command line, to play with your data. Unfortunately, R is quite separated from the command line. Once you start it, you’re in a separate environment. R doesn’t really play well with the command line because you cannot pipe any data into it and it also doesn’t support any one-liners that you can specify.
For example, imagine that you have a CSV file called tips.csv, and you would like compute the tip percentage, and save the result. To accomplish this in R you would first startup R:
$ #? [echo]
$ R
And then run the following commands:
> tips <- read.csv('tips.csv', header = T, sep = ',', stringsAsFactors = F)
> tips.percent <- tips$tip / tips$bill * 100
> cat(tips.percent, sep = '\n', file = 'percent.csv')
> q("no")
Afterwards, you can continue with the saved file percent.csv on the command line. Note that there is only one command that is associated with what we want to accomplish specifically. The other commands are necessary boilerplate. Typing in this boilerplate in order to accomplish something simple is cumbersome and breaks your workflow. Sometimes, you only want to do one or two things at a time to your data. Wouldn’t it be great if we could harness the power of R and be able to use it from the command line?
This is where Rio
comes in. The name Rio stands for R input/output, because it enables you to use R as a filter on the command line. You simply pipe CSV data into Rio
and you specify the R commands that you want to run on it. Let’s perform the same task as before, but now using Rio
:
$ < data/tips.csv Rio -e 'df$tip / df$bill * 100' | head -n 10
Rio
can execute multiple R command that are separated by semicolons. So, if you wanted to add a column called percent to the input data, you could do the following:
$ < data/tips.csv Rio -e 'df$percent <- df$tip / df$bill * 100; df' | head
These small one-liners are possible because Rio
takes care of all the boilerplate. Being able to use the command line for this and capture the power of R into a one-liner is fantastic, especially if you want to keep on working on the command line. Rio
assumes that the input data is in CSV format with a header. (By specifying the -n
command-line argument Rio
does not consider the first row to be the header and creates default column names.) Behind the scenes, Rio
writes the piped data to a temporary CSV file and creates a script that:
- Import required libraries.
- Loads the CSV file as a data frame.
- Generates a
ggplot2
object if needed (more on this in the next section). Runs the specified commands.
Prints the result of the last command to standard output.
So now, if you wanted to do one or two things to your data set with R, you can specify it as a one-liner, and keep on working on the command line. All the knowledge that you already have about R can now be used from the command line. With Rio
, you can even create sophisticated visualizations, as you will see later in this chapter.
Rio doesn’t have to be used as a filter, meaning the output doesn’t have to be a in CSV format per se. You can compute
$ < data/iris.csv Rio -e 'mean(df$sepal_length)'
$ < data/iris.csv Rio -e 'sd(df$sepal_length)'
$ < data/iris.csv Rio -e 'sum(df$sepal_length)'
If we wanted to compute the five summary statistics, we would do:
$ < iris.csv Rio -e 'summary(df$sepal_length)'
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
You can also compute the skewness (symmetry of the distribution) and kurtosis (peakedness of the distribution), but then you need to have the moments
package installed:
$ #? [echo]
$ < data/iris.csv Rio -e 'skewness(df$sepal_length)'
$ < data/iris.csv Rio -e 'kurtosis(df$petal_width)'
Correlation between two features:
$ < tips.csv Rio -e 'cor(df$bill, df$tip)'
0.6757341
Or a correlation matrix:
$ < data/tips.csv csvcut -c bill,tip | Rio -f cor | csvlook
|--------------------+--------------------|
| bill | tip |
|--------------------+--------------------|
| 1 | 0.675734109211365 |
| 0.675734109211365 | 1 |
|--------------------+--------------------|
Note that with the command-line argument -f
, we can specify the function to apply to the data frame df
. In this case, it is the same as -e cor(df)
.
You can even create a stem plot (Tukey 1977) using Rio
:
$ < data/iris.csv Rio -e 'stem(df$sepal_length)'