Chapter 9 Modeling Data - 9.4 Clustering with Weka - 《Data Science at the Command Line》

9.4 Clustering with Weka

9.4 Clustering with Weka

In this section we’ll be clustering our wine data set into groups. Like, dimensionality reduction, clustering is usually unsupervised. It can be used go gain an understanding of how your data is structured. Once the data has been clustered, you can visualize the result by coloring the data points according to their cluster assignment. For most algorithms you specify upfront how many groups you want the data to be clustered in; some algorithms are able to determine a suitable number of groups.

For this task we’ll use Weka, which is being maintained by the Machine Learning Group at the University of Waikato (Hall et al. 2009). If you already know Weka, then you probably know it as a software with a graphical user interface. However, as you’ll see, Weka can also be used from the command line (albeit with some modifications). Besides clustering, Weka can also do classification and regression, but we’re going to be using other tools for those machine learning tasks.

9.4.1 Introducing Weka

You may ask, surely there are better command-line tools for clustering? And you are right. One reason we include Weka in this chapter is to show you how you can work around these imperfections by building additional command-line tools. As you spend more time on the command line and try out other command-line tools, chances are that you come across one that seems very promising at first, but does not work as you expected. A common imperfection is the command-line tool does not handle standard in or standard out correctly. In the next section we’ll point out these imperfections and demonstrate how we work around them.

9.4.2 Taming Weka on the Command Line

Weka can be invoked from the command line, but it’s definitely not straightforward or user friendly. Weka is programmed in Java, which means that you have to run java, specify the location of the weka.jar file, and specify the individual class you want to call. For example, Weka has a class called MexicanHat, which generates a toy data set. To generate 10 data points using this class, you would run:

$ java -cp ~/bin/weka.jar weka.datagenerators.classifiers.regression.MexicanHat\
>  -n 10 | fold
%
% Commandline
%
% weka.datagenerators.classifiers.regression.MexicanHat -r weka.datagenerators.c
lassifiers.regression.MexicanHat-S_1_-n_10_-A_1.0_-R_-10..10_-N_0.0_-V_1.0 -S 1
-n 10 -A 1.0 -R -10..10 -N 0.0 -V 1.0
%
@relation weka.datagenerators.classifiers.regression.MexicanHat-S_1_-n_10_-A_1.0
_-R_-10..10_-N_0.0_-V_1.0
@attribute x numeric
@attribute y numeric
@data
4.617564,-0.215591
-1.798384,0.541716
-5.845703,-0.072474
-3.345659,-0.060572
9.355118,0.00744
-9.877656,-0.044298
9.274096,0.016186
8.797308,0.066736
8.943898,0.051718
8.741643,0.072209

Don’t worry about the output of this command, we’ll discuss that later. At this moment, we’re concerned with the usage of Weka. There are a couple of things to note here:

You need run java, which is counter-intuitive.
The jar file contains over 2000 classes, and only about 300 of those can be used from the command line directly. How do you know which ones?
You need to specify entire namespace of the class: weka.datagenerators.classifiers.regression.MexicanHat. How are you supposed to remember that?Does this mean that we’re going to give up on Weka? Of course not! Since Weka does contain a lot of useful functionality, we’re going to tackle these issues in the next three subsections.

9.4.2.1 An Improved Command-line Tool for Weka

Save the following snippet as a new file called weka and put it somewhere on your PATH:

#!/usr/bin/env bash
java -Xmx1024M -cp ${WEKAPATH}/weka.jar "weka.$@"

Subsequently, add the following line to your .bashrc file so that weka can be called from anywhere:

$ export WEKAPATH=/home/vagrant/repos/weka

We can now call the previous example with:

$ weka datagenerators.classifiers.regression.MexicanHat -n 10

9.4.2.2 Usable Weka Classes

As mentioned, the file weka.jar contains over 2000 classes. Many of them cannot be used from the command line directly. We consider a class usable from the command line when it provides us with a help message if we invoke it with -h. For example:

$ weka datagenerators.classifiers.regression.MexicanHat -h
Data Generator options:
-h
        Prints this help.
-o <file>
        The name of the output file, otherwise the generated data is
        printed to stdout.
-r <name>
        The name of the relation.
-d
        Whether to print debug informations.
-S
        The seed for random function (default 1)
-n <num>
        The number of examples to generate (default 100)
-A <num>
        The amplitude multiplier (default 1.0).
-R <num>..<num>
        The range x is randomly drawn from (default -10.0..10.0).
-N <num>
        The noise rate (default 0.0).
-V <num>
        The noise variance (default 1.0).

Now that’s usable. This, for example, is not a usable class:

$ weka filters.SimpleFilter -h
java.lang.ClassNotFoundException: -h
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:171)
        at weka.filters.Filter.main(Filter.java:1344)
-h

The following pipeline runs weka with every class in weka.jar and -h and saves the standard output and standard error to a file with the same name as the class:

$ unzip -l $WEKAPATH/weka.jar |
> sed -rne 's/.*(weka)\/([^g])([^$]*)\.class$/\2\3/p' |
> tr '/' '.' |
> parallel --timeout 1 -j4 -v "weka {} -h > {} 2>&1"

We now have 749 files. With the following command we save the filename of every files which does not contain the string Exception to weka.classes:

$ grep -L 'Exception' * | tee $WEKAPATH/weka.classes

This still comes down to 332 classes! Here are a few classes that might be of interest):

attributeSelection.PrincipalComponents
classifiers.bayes.NaiveBayes
classifiers.evaluation.ConfusionMatrix
classifiers.functions.SimpleLinearRegression
classifiers.meta.AdaBoostM1
classifiers.trees.RandomForest
clusterers.EM
filters.unsupervised.attribute.Normalize

As you can see, weka offers a whole range of classes and functionality.

9.4.2.3 Adding Tab Completion

At this moment, you still need to type in the entire class name yourself. You can add so-called tab completion by adding the following snippet to your .bashrc file after you export WEKAPATH:

_completeweka() {
  local curw=${COMP_WORDS[COMP_CWORD]}
  local wordlist=$(cat $WEKAPATH/weka.classes)
  COMPREPLY=($(compgen -W '${wordlist[@]}' -- "$curw"))
  return 0
}
complete -o nospace -F _completeweka weka

This function makes use of the weka.classes file we generated earlier. If you now type: weka clu<Tab><Tab><Tab> on the command line, you are presented with a list of all classes that have to do with clustering:

$ weka clusterers.
clusterers.CheckClusterer
clusterers.CLOPE
clusterers.ClusterEvaluation
clusterers.Cobweb
clusterers.DBSCAN
clusterers.EM
clusterers.FarthestFirst
clusterers.FilteredClusterer
clusterers.forOPTICSAndDBScan.OPTICS_GUI.OPTICS_Visualizer
clusterers.HierarchicalClusterer
clusterers.MakeDensityBasedClusterer
clusterers.OPTICS
clusterers.sIB
clusterers.SimpleKMeans
clusterers.XMeans

Creating a command-line tool weka and adding tab completion makes sure that Weka is a little bit more friendly to use on the command line.

9.4.3 Converting between CSV to ARFF Data Formats

Weka uses ARFF as a file format. This is basically CSV with additional information about the columns. We’ll use two convenient command-line tools to convert between CSV and ARFF, namely csv2arff (see Example 9.1 ) and arff2csv (see Example 9.2).

Example 9.1 (Convert CSV to ARFF)

#!/usr/bin/env bash
weka core.converters.CSVLoader /dev/stdin

Example 9.2 (Convert ARFF to CSV)

#!/usr/bin/env bash
weka core.converters.CSVSaver -i /dev/stdin

9.4.4 Comparing Three Cluster Algorithms

Unfortunately, in order to cluster data using Weka, we need yet another command-line tool to help us with this. The AddCluster class is needed to assign data points to the learned clusters. Unfortunately, this class does not accept data from standard input, not even when we specify -i /dev/stdin because it expects a file with the .arff extension. We consider this to be bad design. The source code of weka-cluster is:

#!/usr/bin/env bash
ALGO="$@"
IN=$(mktemp --tmpdir weka-cluster-XXXXXXXX).arff
finish () {
        rm -f $IN
}
trap finish EXIT
csv2arff > $IN
weka filters.unsupervised.attribute.AddCluster -W "weka.${ALGO}" -i $IN \
-o /dev/stdout | arff2csv

Now we can apply the EM clustering algorithm and save the assignment as follows:

$ cd data
$ < wine-both-scaled.csv csvcut -C quality,type |          
> weka-cluster clusterers.EM -N 5 |                        
> csvcut -c cluster > data/wine-both-cluster-em.csv

Use the scaled features, and don’t use the features quality and type for the cluster.
Apply the algorithm using weka-cluster.
Only save the cluster assignment.We’ll run the same command again for SimpleKMeans and Cobweb algorithms. Now we have three files with cluster assignments. Let’s create a t-SNE mapping in order to visualize the cluster assignments:

$ < wine-both-scaled.csv csvcut -C quality,type | body tapkee --method t-sne |
> header -r x,y > wine-both-xy.csv

Next, the cluster assignments are combined with the t-SNE mapping using paste and a scatter plot is created using Rio-scatter:

$ parallel -j1 "paste -d, wine-both-xy.csv wine-both-cluster-{}.csv | "\
> "Rio-scatter x y cluster | display" ::: em simplekmeans cobweb

Figure 9.3: EM

Figure 9.4: SimpleKMeans

Figure 7.8: Cobweb

Admittedly, we have through a lot of trouble taming Weka. The exercise was worth it, because some day you may run into a command-line tool that works different from what you expect. Now you know that there are always ways to work around such command-line tools.