9.6 Classification with BigML
In this fourth and last modeling section we’re going to classify wines as either red or wine. For this we’ll be using a solution called BigML, which provides a prediction API. This means that the actual modeling and predicting takes place in the cloud, which is useful if you need a bit more power than your own computer can offer.
Although prediction APIs are relatively young, they are upcoming, which is why we’ve included one in this chapter. Other providers of prediction APIs are Google (see https://developers.google.com/prediction) and PredictionIO (see http://prediction.io). One advantage of BigML is that they offer a convenient command-line tool called bigmler
(BigML 2014) that interfaces with their API. We can use this command-line like any other presented in this book, but behind the scenes, our data set is being sent to BigML’s servers, which perform the classification and send back the results.
9.6.1 Creating Balanced Train and Test Data Sets
First, we create a balanced data set to ensure that both class are represented equally. For this, we use csvstack
(Groskopf 2014h), shuf
(Eggert 2012), head
, and csvcut
:
$ csvstack -n type -g red,white wine-red-clean.csv \
> <(< wine-white-clean.csv body shuf | head -n 1600) |
> csvcut -c fixed_acidity,volatile_acidity,citric_acid,\
> residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,\
> density,ph,sulphates,alcohol,type > wine-balanced.csv
This long command breaks down as follows:
csvstack
is used to combine multiple data sets. It creates a new column type, which has the value red for all rows coming from the first file wine-red-clean.csv and white for all rows coming from the second file.- The second file is passed to
csvstack
using file redirection. This allows us to create a temporary file usingshuf
, which creates a random permutation of the wine-white-clean.csv andhead
which only selects the header and the first 1559 rows. - Finally, we reorder the columns of this data set using
csvcut
because by default,bigmler
assumes that the last column is the label.Let’s verify that wine-balanced.csv is actually balanced by counting the number of instances per class usingparallel
andgrep
:
$ parallel --tag grep -c {} wine-balanced.csv ::: red white
red 1599
white 1599
As you can see, the data set wine-balanced.csv contains both 1599 red and 1599 white wines. Next we split into train and test data sets using split
(Granlund and Stallman 2012b):
$ < wine-balanced.csv header > wine-header.csv
$ tail -n +2 wine-balanced.csv | shuf | split -d -n r/2
$ parallel --xapply "cat wine-header.csv x0{1} > wine-{2}.csv" \
> ::: 0 1 ::: train test
This is another long command that deserves to be broken down:
- Get the header using
header
and save it to a temporary file named wine-header.csv - Mix up the red and white wines using
tail
andshuf
and split it into two files named x00 and x01 using a round robin distribution. - Use
cat
to combine the header saved in wine-header.csv and the rows stored in x00 to save it as wine-train.csv; similarly for x01 and wine-test.csv. The—xapply
command-line argument tellsparallel
to loop over the two input sources in tandem.Let’s check again number of instances per class in both wine-train.csv and wine-test.csv:
$ parallel --tag grep -c {2} wine-{1}.csv ::: train test ::: red white
train red 821
train white 778
test white 821
test red 778
That looks like are data sets are well balanced. We’re now ready to call the prediction API using bigmler
.
9.6.2 Calling the API
You can obtain a BigML username and API key at https://bigml.com/developers. Be sure to set the variables BIGML_USERNAME and BIGML_API_KEY in .bashrc with the appropriate values.
The API call is quite straightforward, and the meaning of each command-line argument is obvious from it’s name.
$ bigmler --train data/wine-train.csv \
> --test data/wine-test-blind.csv \
> --prediction-info full \
> --prediction-header \
> --output-dir output \
> --tag wine \
> --remote
The file wine-test-blind.csv is just wine-test with the type column (so the label) removed. After this call is finished, the results can be found in the output directory:
$ tree output
output
├── batch_prediction
├── bigmler_sessions
├── dataset
├── dataset_test
├── models
├── predictions.csv
├── source
└── source_test
0 directories, 8 files
9.6.3 Inspecting the Results
The file which is of most interest is output/predictions.csv:
$ csvcut output/predictions.csv -c type | head
type
white
white
red
red
white
red
red
white
red
We can compare these predicted labels with the labels in our test data set. Let’s count the number of misclassifications:
$ paste -d, <(csvcut -c type data/wine-test.csv) \
> <(csvcut -c type output/predictions.csv) |
> awk -F, '{ if ($1 != $2) {sum+=1 } } END { print sum }'
766
- First, we combine the type columns of both data/wine-test.csv and output/predictions.csv.
- Then, we use
awk
to keep count of when the two columns differ in value.As you can see, BigML’s API misclassified 766 wines out of 1599. This isn’t a good result, but please note that we just blindly applied an algorithm to a data set, which we normally wouldn’t do.
9.6.4 Conclusion
BigML’s prediction API has proven to be easy to use. As with many of the command-line tools discussed in this book, we’ve barely scratched the surface with BigML. For completeness, we should mention that:
- BigML’s command-line tool also allows for local computations, which is useful for debugging.
- Results can also be inspected using BigML’s web interface.
- BigML can also perform regression tasks.Please see https://bigml.com/developers for a complete overview of BigML’s features.
Although we’ve only been able to experiment with one prediction API, we do believe that prediction APIs in general are worthwhile to consider for doing data science.