9.3 Dimensionality Reduction with Tapkee
The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional mapping. The challenge is to keep similar data points close together on the lower-dimensional mapping. As we’ve seen in the previous section, our wine data set contained 13 features. We’ll stick with two dimensions because that’s straight forward to visualize.
Dimensionality reduction is often regarded as being part of exploring step. It’s useful for when there are too many features for plotting. You could do a scatter-plot matrix, but that only shows you two features at a time. It’s also useful as a pre-processing step for other machine learning algorithms.
Most dimensionality reduction algorithms are unsupervised. This means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.
In this section we’ll look at two techniques: PCA, which stands for Principal Components Analysis (Pearson 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008).
9.3.1 Introducing Tapkee
Tapkee is a C++ template library for dimensionality reduction (Lisitsyn, Widmer, and Garcia 2013). The library contains implementations of many dimensionality reduction algorithms, including:
- Locally Linear Embedding
- Isomap
- Multidimensional scaling
- PCA
- t-SNETapkee’s website: http://tapkee.lisitsyn.me/, contains more information about these algorithms. Although Tapkee is mainly a library that can be included in other applications, it also offers a command-line tool. We’ll use this to perform dimensionality reduction on our wine data set.
9.3.2 Installing Tapkee
If you aren’t running the Data Science Toolbox, you’ll need to download and compile Tapkee yourself. First make sure that you have CMake
installed. On Ubuntu, you simply run:
$ sudo apt-get install cmake
Please consult Tapkee’s website for instructions for other operating systems. Then execute the following commands to download the source and compile it:
$ curl -sL https://github.com/lisitsyn/tapkee/archive/master.tar.gz > \
> tapkee-master.tar.gz
$ tar -xzf tapkee-master.tar.gz
$ cd tapkee-master
$ mkdir build && cd build
$ cmake ..
$ make
This creates a binary executable named tapkee
.
9.3.3 Linear and Non-linear Mappings
First, we’ll scale the features using standardization such that each feature is equally important. This generally leads to better results when applying machine learning algorithms.
To scale we use a combination of cols
and Rio
:
$ < wine-both.csv cols -C type Rio -f scale > wine-both-scaled.csv
Now we apply both dimensionality reduction techniques and visualize the mapping using Rio-scatter
:
$ < wine-both-scaled.csv cols -C type body tapkee --method pca |
> header -r x,y,type | Rio-scatter x y type |
> tee tapkee-wine-pca.png | display
Figure 9.1: PCA
$ < wine-both-scaled.csv cols -C type body tapkee --method t-sne |
> header -r x,y,type | Rio-scatter x y type |
> tee tapkee-wine-t-sne.png | display
Figure 9.2: t-SNE
Note that there’s not a single GNU core util (i.e., classic command-line tool) in this one-liner. Now that’s the power of the command line!