7.2 Inspecting Data and its Properties
In this section we’ll demonstrate how to inspect your data set and its properties. Because the upcoming visualization and modeling techniques expect the data to be in tabular format, we’ll assume that the data is in CSV format. You can use the techniques described in Chapter 5 to convert your data to CSV if necessary.
For simplicity sake, we’ll also assume that your data has a header. In the first subsection we are going to determine whether that is the case. Once we know we have a header, we can continue answering the following questions:
- How many data points and features does the data set have?
- What does the raw data look like?
- What kind of features does the data set have?
- Can some of these features be treated as categorical or as factors?
7.2.1 Header Or Not, Here I Come
You can check whether your file has a header by printing the first few lines:
$ #? [echo]
$ head file.csv | csvlook
It is then up to you to decide whether the first line is indeed a header or already the first data point. When the data set contains no header or when its header contains newlines, you’re best off going back to Chapter 5 and correct that.
7.2.2 Inspect All The Data
If you want to inspect the raw data, then it’s best not to use the cat
command-line tool, since cat
prints all the data to the screen in one go. In order to inspect the raw data at your own pace, we recommend to use less
(Nudelman 2013) with the -S
command-line argument:
$ #? [echo]
$ less -S file.csv
The -S
command-line argument ensures that long lines are not being wrapped when they don’t fit in the terminal. Instead, less
allows you to scroll horizontally to see the rest of the lines. The advantage of less
is that it does not load the entire file into memory, which is good for viewing large files. Once you’re in less
, you can scroll down a full screen by pressing <Space>
. Scrolling horizontally is done by pressing <Left>
and <Right>
. Press g
and G
to go to start and the end of the file, respectively. Quiting less
is done by pressing q
. Read the man page for more key bindings.
If you want the data set to be nicely formatted, you can add in csvlook
:
$ #? [echo]
$ < file.csv csvlook | less -S
Unfortunately, csvlook
needs to read the entire file into memory in order to determine the width of the columns. So, when you want to inspect a very large file, then either you may want to get a subset (using sample
, for example) or you may need to be patient.
7.2.3 Feature Names and Data Types
In order to gain insight into the data set, it is useful to print the feature names and study them. After all, the feature names may indicate the meaning of the feature. You can use the following sed
expression for this:
$ < data/iris.csv sed -e 's/,/\n/g;q'
Note that this basic command assumes that the file is delimited by commas. Just as reminder: if you intend to use this command often, you could define a function in your .bashrc file called, say, names:
Example 7.1 ()
names () { sed -e 's/,/\n/g;q'; }
Which you can then you use like this:
$ < data/investments.csv names
company_permalink
company_name
company_category_list
company_market
company_country_code
company_state_code
company_region
company_city
investor_permalink
investor_name
investor_category_list
investor_market
investor_country_code
investor_state_code
investor_region
investor_city
funding_round_permalink
funding_round_type
funding_round_code
funded_at
funded_month
funded_quarter
funded_year
raised_amount_usd
We can go a step further than just printing the column names. Besides the names of the columns, it would be very useful to know what type of values each column contains. Examples of data types are a string of characters, a numerical value, or a date. Assume that we have the following toy data set:
$ < data/datatypes.csv csvlook
|-----+--------+-------+----------+------------------+------------+----------|
| a | b | c | d | e | f | g |
|-----+--------+-------+----------+------------------+------------+----------|
| 2 | 0.0 | FALSE | "Yes!" | 2011-11-11 11:00 | 2012-09-08 | 12:34 |
| 42 | 3.1415 | TRUE | Oh, good | 2014-09-15 | 12/6/70 | 0:07 PM |
| 66 | | False | 2198 | | | |
|-----+--------+-------+----------+------------------+------------+----------|
We’ve already used csvsql
in Chapter 5 to execute SQL queries directly on CSV data. When no command-line arguments are passed, it generates the necessary SQL statement that would be needed if we were to insert this data into an actual database. We can use the output also for ourselves to inspect what the inferred column types are:
csvsql data/datatypes.csv
CREATE TABLE datatypes (
a INTEGER NOT NULL,
b FLOAT,
c BOOLEAN NOT NULL,
d VARCHAR(8) NOT NULL,
e DATETIME,
f DATE,
g TIME,
CHECK (c IN (0, 1))
);
provides on overview of what the various SQL data types mean. If a column has the NOT NULL string printed after the data type, then that column contains no missing values.
Type | Python | SQL |
---|---|---|
Character string | unicode | VARCHAR |
Boolean | bool | BOOLEAN |
Integer | int | INTEGER |
Real number | float | FLOAT |
Date | datetime.date | DATE |
Time | datetime.time | TIME |
Date and time | datetime.datetime | DATETIME |
7.2.4 Unique Identifiers, Continuous Variables, and Factors
Knowing the data type of each feature is not enough. It is also essential to know what each feature represents. Having knowledge about the domain is very useful here, however we may also get some ideas from the data itself.
Both a string and an integer could be a unique identifier or could represent a category. In the latter case, this could be used to assign a color to your visualization. If an integer denotes, say, the ZIP Code, then it doesn’t make sense to compute the average.
To determine whether a feature should be treated as a unique identifier or categorical variable (or factor in R terms), you could count the number of unique values for a specific column:
$ cat data/iris.csv | csvcut -c species | body "sort | uniq | wc -l"
species
3
Or we can use csvstat
(Groskopf 2014a), which is part of csvkit
, to get the number of unique values for each column:
$ csvstat data/investments2.csv --unique
1. company_permalink: 27342
2. company_name: 27324
3. company_category_list: 8759
4. company_market: 443
5. company_country_code: 150
6. company_state_code: 147
7. company_region: 1079
8. company_city: 3305
9. investor_permalink: 11176
10. investor_name: 11135
11. investor_category_list: 468
12. investor_market: 134
13. investor_country_code: 111
14. investor_state_code: 80
15. investor_region: 549
16. investor_city: 1198
17. funding_round_permalink: 41790
18. funding_round_type: 13
19. funding_round_code: 15
20. funded_at: 3595
21. funded_month: 295
22. funded_quarter: 121
23. funded_year: 34
24. raised_amount_usd: 6143
If the number of unique values is low compared to the number of rows, then that feature may indeed be treated as a categorical one (such as funding_round_type). If the number is equal to the number of rows, it may be a unique identifier (such as company_permalink).