Chapter 9 Modeling Data - 9.5 Regression with SciKit-Learn Laboratory - 《Data Science at the Command Line》

9.5 Regression with SciKit-Learn Laboratory

9.5 Regression with SciKit-Learn Laboratory

In this section, we’ll be predicting the quality of the white wine, based on their physicochemical properties. Because the quality is a number between 0 and 10, we can consider predicting the quality as a regression task. Generally speaking, using training data points, we train three regression models using three different algorithms.

We’ll be using the SciKit-Learn Laboratory (or SKLL) package for this. If you’re not using the Data Science Toolbox, you can install SKLL using pip:

$ pip install skll

If you’re running Python 2.7, you also need to install the following packages:

$ pip install configparser futures logutils

9.5.1 Preparing the Data

SKLL expects that the train and test data have the same filenames, located in separate directories. However, in this example, we’re going to use cross-validation, meaning that we only need to specify a training data set. Cross-validation is a technique that splits up the whole data set into a certain number of subsets. These subsets are called folds. (Usually, five or ten folds are used.)

We need to add an identifier to each row so that we can easily identify the data points later (the predictions are not in the same order as the original data set):

$ mkdir train
$ wine-white-clean.csv nl -s, -w1 -v0 | sed '1s/0,/id,/' > train/features.csv

9.5.2 Running the Experiment

Create a configuration file called predict-quality.cfg:

[General]
experiment_name = Wine
task = cross_validate
[Input]
train_location = train
featuresets = [["features.csv"]]
learners = ["LinearRegression","GradientBoostingRegressor","RandomForestRegressor"]
label_col = quality
[Tuning]
grid_search = false
feature_scaling = both
objective = r2
[Output]
log = output
results = output
predictions = output

We run the experiment using the run_experiment command-line tool [cite:run_experiment]:

$ run_experiment -l evaluate.cfg

The -l command-line argument indicates that we’re running in local mode. SKLL also offers the possibility to run experiments on clusters. The time it takes to run the experiment depends on the complexity of the chosen algorithms.

9.5.3 Parsing the Results

Once all algorithms are done, the results can now be found in the directory output:

$ cd output
$ ls -1
Wine_features.csv_GradientBoostingRegressor.log
Wine_features.csv_GradientBoostingRegressor.predictions
Wine_features.csv_GradientBoostingRegressor.results
Wine_features.csv_GradientBoostingRegressor.results.json
Wine_features.csv_LinearRegression.log
Wine_features.csv_LinearRegression.predictions
Wine_features.csv_LinearRegression.results
Wine_features.csv_LinearRegression.results.json
Wine_features.csv_RandomForestRegressor.log
Wine_features.csv_RandomForestRegressor.predictions
Wine_features.csv_RandomForestRegressor.results
Wine_features.csv_RandomForestRegressor.results.json
Wine_summary.tsv

SKLL generates four files for each learner: one log, two with results, and one with predictions. Moreover, SKLL generates a summary file, which contains a lot of information about each individual fold (too much to show here). We can extract the relevant metrics using the following SQL query:

$ < Wine_summary.tsv csvsql --query "SELECT learner_name, pearson FROM stdin "\
> "WHERE fold = 'average' ORDER BY pearson DESC" | csvlook
|----------------------------+----------------|
|  learner_name              | pearson        |
|----------------------------+----------------|
|  RandomForestRegressor     | 0.741860521533 |
|  GradientBoostingRegressor | 0.661957860603 |
|  LinearRegression          | 0.524144785555 |
|----------------------------+----------------|

The relevant column here is pearson, which indicates the Pearson’s ranking correlation. This is value between -1 and 1 that indicates the correlation between the true ranking (of quality scores) and the predicted ranking. Let’s paste all the predictions back to the data set:

$ parallel "csvjoin -c id train/features.csv <(< output/Wine_features.csv_{}"\
> ".predictions | tr '\t' ',') | csvcut -c id,quality,prediction > {}" ::: \
> RandomForestRegressor GradientBoostingRegressor LinearRegression
$ csvstack *Regres* -n learner --filenames > predictions.csv

And create a plot using Rio:

$ < predictions.csv Rio -ge 'g+geom_point(aes(quality, round(prediction), '\
> 'color=learner), position="jitter", alpha=0.1) + facet_wrap(~ learner) + '\
> 'theme(aspect.ratio=1) + xlim(3,9) + ylim(3,9) + guides(colour=FALSE) + '\
> 'geom_smooth(aes(quality, prediction), method="lm", color="black") + '\
> 'ylab("prediction")' | display