Chapter 9 Modeling Data - 9.6 Classification with BigML - 《Data Science at the Command Line》

9.6 Classification with BigML

9.6 Classification with BigML

In this fourth and last modeling section we’re going to classify wines as either red or wine. For this we’ll be using a solution called BigML, which provides a prediction API. This means that the actual modeling and predicting takes place in the cloud, which is useful if you need a bit more power than your own computer can offer.

Although prediction APIs are relatively young, they are upcoming, which is why we’ve included one in this chapter. Other providers of prediction APIs are Google (see https://developers.google.com/prediction) and PredictionIO (see http://prediction.io). One advantage of BigML is that they offer a convenient command-line tool called bigmler (BigML 2014) that interfaces with their API. We can use this command-line like any other presented in this book, but behind the scenes, our data set is being sent to BigML’s servers, which perform the classification and send back the results.

9.6.1 Creating Balanced Train and Test Data Sets

First, we create a balanced data set to ensure that both class are represented equally. For this, we use csvstack (Groskopf 2014 h), shuf (Eggert 2012), head, and csvcut:

$ csvstack -n type -g red,white wine-red-clean.csv \                   
> <(< wine-white-clean.csv body shuf | head -n 1600) |                 
> csvcut -c fixed_acidity,volatile_acidity,citric_acid,\               
> residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,\
> density,ph,sulphates,alcohol,type > wine-balanced.csv

This long command breaks down as follows:

csvstack is used to combine multiple data sets. It creates a new column type, which has the value red for all rows coming from the first file wine-red-clean.csv and white for all rows coming from the second file.
The second file is passed to csvstack using file redirection. This allows us to create a temporary file using shuf, which creates a random permutation of the wine-white-clean.csv and head which only selects the header and the first 1559 rows.
Finally, we reorder the columns of this data set using csvcut because by default, bigmler assumes that the last column is the label.Let’s verify that wine-balanced.csv is actually balanced by counting the number of instances per class using parallel and grep:

$ parallel --tag grep -c {} wine-balanced.csv ::: red white
red      1599
white    1599

As you can see, the data set wine-balanced.csv contains both 1599 red and 1599 white wines. Next we split into train and test data sets using split (Granlund and Stallman 2012 b):

$ < wine-balanced.csv header > wine-header.csv                   
$ tail -n +2 wine-balanced.csv | shuf | split -d -n r/2          
$ parallel --xapply "cat wine-header.csv x0{1} > wine-{2}.csv" \ 
> ::: 0 1 ::: train test

This is another long command that deserves to be broken down:

Get the header using header and save it to a temporary file named wine-header.csv
Mix up the red and white wines using tail and shuf and split it into two files named x00 and x01 using a round robin distribution.
Use cat to combine the header saved in wine-header.csv and the rows stored in x00 to save it as wine-train.csv; similarly for x01 and wine-test.csv. The —xapply command-line argument tells parallel to loop over the two input sources in tandem.Let’s check again number of instances per class in both wine-train.csv and wine-test.csv:

$ parallel --tag grep -c {2} wine-{1}.csv ::: train test ::: red white
train red       821
train white     778
test white      821
test red        778

That looks like are data sets are well balanced. We’re now ready to call the prediction API using bigmler.

9.6.2 Calling the API

You can obtain a BigML username and API key at https://bigml.com/developers. Be sure to set the variables BIGML_USERNAME and BIGML_API_KEY in .bashrc with the appropriate values.

The API call is quite straightforward, and the meaning of each command-line argument is obvious from it’s name.

$ bigmler --train data/wine-train.csv \
> --test data/wine-test-blind.csv \
> --prediction-info full \
> --prediction-header \
> --output-dir output \
> --tag wine \
> --remote

The file wine-test-blind.csv is just wine-test with the type column (so the label) removed. After this call is finished, the results can be found in the output directory:

$ tree output
output
├── batch_prediction
├── bigmler_sessions
├── dataset
├── dataset_test
├── models
├── predictions.csv
├── source
└── source_test
0 directories, 8 files

9.6.3 Inspecting the Results

The file which is of most interest is output/predictions.csv:

$ csvcut output/predictions.csv -c type | head
type
white
white
red
red
white
red
red
white
red

We can compare these predicted labels with the labels in our test data set. Let’s count the number of misclassifications:

$ paste -d, <(csvcut -c type data/wine-test.csv) \        
> <(csvcut -c type output/predictions.csv) |
> awk -F, '{ if ($1 != $2) {sum+=1 } } END { print sum }' 
766

First, we combine the type columns of both data/wine-test.csv and output/predictions.csv.
Then, we use awk to keep count of when the two columns differ in value.As you can see, BigML’s API misclassified 766 wines out of 1599. This isn’t a good result, but please note that we just blindly applied an algorithm to a data set, which we normally wouldn’t do.

9.6.4 Conclusion

BigML’s prediction API has proven to be easy to use. As with many of the command-line tools discussed in this book, we’ve barely scratched the surface with BigML. For completeness, we should mention that:

BigML’s command-line tool also allows for local computations, which is useful for debugging.
Results can also be inspected using BigML’s web interface.
BigML can also perform regression tasks.Please see https://bigml.com/developers for a complete overview of BigML’s features.

Although we’ve only been able to experiment with one prediction API, we do believe that prediction APIs in general are worthwhile to consider for doing data science.