9.2 More Wine Please!
In this chapter, we’ll be using a data set of wine tastings. Specifically, red and white Portuguese “Vinho Verde” wine. Each data point represents a wine, and consists of 11 physicochemical properties: (1) fixed acidity, (2) volatile acidity, (3) citric acid, (4) residual sugar, (5) chlorides, (6) free sulfur dioxide, (7) total sulfur dioxide, (8) density, (9) pH, (10) sulphates, and (11) alcohol. There is also a quality score. This score lies between 0 (very bad) and 10 (excellent) and is the median of at least three evaluation by wine experts. More information about this data set is available at http://archive.ics.uci.edu/ml/datasets/Wine+Quality.
There are two data sets: one for white wine and one for red wine. The very first step is to obtain the two data sets using curl
(and of course parallel
because we haven’t got all day):
$ cd ~/book/ch09
$ parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases"\
> "/wine-quality/winequality-{}.csv > data/wine-{}.csv" ::: red white
The triple colon is yet another way we can pass data to parallel
. Let’s inspect both data sets using head
and count the number of rows using wc -l
:
$ head -n 5 wine-{red,white}.csv | fold
==> wine-red.csv <==
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f
ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";
"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
==> wine-white.csv <==
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f
ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";
"quality"
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
$ wc -l wine-{red,white}.csv
1600 wine-red.csv
4899 wine-white.csv
6499 total
At first sight this data appears to be very clean already. Still, let’s scrub this data a little bit so that it conforms more with what most command-line tools are expecting. Specifically, we’ll:
- Convert the header to lowercase.
- Convert the semi-colons to commas.
- Convert spaces to underscores.
- Remove unnecessary quotes.These things can all be taken care of by ‘tr`. Let’s use a for loop this time—for old times’ sake—to process both data sets:
for T in red white; do
< wine-$T.csv tr '[A-Z]; ' '[a-z],_' | tr -d \" > wine-${T}-clean.csv
done
Let’s also create a data set by combining the two data sets. We’ll use csvstack
to add a column named “type” which will be “red” for rows of the first file, and “white” for rows of the second file:
$ HEADER="$(head -n 1 wine-red-clean.csv),type"
$ csvstack -g red,white -n type wine-{red,white}-clean.csv |
> csvcut -c $HEADER > wine-both-clean.csv
The new column type is added to the beginning of the table. Because some of the command-line tools that we’ll use in this chapter assume that the class label is the last column, we’ll rearrange the columns by using csvcut
. Instead of typing all 13 columns, we temporary store the desired header into a variable $HEADER before we call csvstack
.
It’s good to check whether there are any missing values in this data set:
$ csvstat wine-both-clean.csv --nulls
1. fixed_acidity: False
2. volatile_acidity: False
3. citric_acid: False
4. residual_sugar: False
5. chlorides: False
6. free_sulfur_dioxide: False
7. total_sulfur_dioxide: False
8. density: False
9. ph: False
10. sulphates: False
11. alcohol: False
12. quality: False
13. type: False
Excellent! Just out of curiosity, let’s see what the how the distribution of quality looks like for both red and white wines.
$ < wine-both-clean.csv Rio -ge 'g+geom_density(aes(quality, '\
'fill=type), adjust=3, alpha=0.5)' | display
From the density plot we can see the quality of white wine is distributed more towards higher values. Does this mean that white wines are overall better than red wines, or that the white wine experts more easily give higher scores than red wine experts? That’s something that the data doesn’t tell us. Or is there perhaps a correlation between alcohol and quality? Let’s use Rio and ggplot again to find out:
$ < wine-both-clean.csv Rio -ge 'ggplot(df, aes(x=alcohol, y=quality, '\
> 'color=type)) + geom_point(position="jitter", alpha=0.2) + '\
> 'geom_smooth(method="lm")' | display
Eureka! Ahem, let’s carry on with some modeling, shall we?