6.8. Pairwise metrics, Affinities and Kernels

6.8. Pairwise metrics, Affinities and Kernels

The sklearn.metrics.pairwise submodule implements utilities to evaluatepairwise distances or affinity of sets of samples.

This module contains both distance metrics and kernels. A brief summary isgiven on the two here.

Distance metrics are functions d(a, b) such that d(a, b) < d(a, c)if objects a and b are considered “more similar” than objects aand c. Two objects exactly alike would have a distance of zero.One of the most popular examples is Euclidean distance.To be a ‘true’ metric, it must obey the following four conditions:

1. d(a, b) >= 0, for all a and b
2. d(a, b) == 0, if and only if a = b, positive definiteness
3. d(a, b) == d(b, a), symmetry
4. d(a, c) <= d(a, b) + d(b, c), the triangle inequality

Kernels are measures of similarity, i.e. s(a, b) > s(a, c)if objects a and b are considered “more similar” than objectsa and c. A kernel must also be positive semi-definite.

There are a number of ways to convert between a distance metric and asimilarity measure, such as a kernel. Let D be the distance, and S bethe kernel:

S = np.exp(-D * gamma), where one heuristic for choosinggamma is 1 / num_features
S = 1. / (D / np.max(D))

The distances between the row vectors of X and the row vectors of Ycan be evaluated using pairwise_distances. If Y is omitted thepairwise distances of the row vectors of X are calculated. Similarly,pairwise.pairwise_kernels can be used to calculate the kernel between Xand Y using different kernel functions. See the API reference for moredetails.

>>>

>>> import numpy as np
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn.metrics.pairwise import pairwise_kernels
>>> X = np.array([[2, 3], [3, 5], [5, 8]])
>>> Y = np.array([[1, 0], [2, 1]])
>>> pairwise_distances(X, Y, metric='manhattan')
array([[ 4.,  2.],
       [ 7.,  5.],
       [12., 10.]])
>>> pairwise_distances(X, metric='manhattan')
array([[0., 3., 8.],
       [3., 0., 5.],
       [8., 5., 0.]])
>>> pairwise_kernels(X, Y, metric='linear')
array([[ 2.,  7.],
       [ 3., 11.],
       [ 5., 18.]])

6.8.1. Cosine similarity

cosine_similarity computes the L2-normalized dot product of vectors.That is, if

and are row vectors,their cosine similarity is defined as:

This is called cosine similarity, because Euclidean (L2) normalizationprojects the vectors onto the unit sphere,and their dot product is then the cosine of the angle between the pointsdenoted by the vectors.

This kernel is a popular choice for computing the similarity of documentsrepresented as tf-idf vectors.cosine_similarity accepts scipy.sparse matrices.(Note that the tf-idf functionality in sklearn.feature_extraction.textcan produce normalized vectors, in which case cosine_similarityis equivalent to linear_kernel, only slower.)

References:

C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction toInformation Retrieval. Cambridge University Press.https://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html

6.8.2. Linear kernel

The function linear_kernel computes the linear kernel, that is, aspecial case of polynomial_kernel with degree=1 and coef0=0 (homogeneous).If x and y are column vectors, their linear kernel is:

6.8.3. Polynomial kernel

The function polynomial_kernel computes the degree-d polynomial kernelbetween two vectors. The polynomial kernel represents the similarity between twovectors. Conceptually, the polynomial kernels considers not only the similaritybetween vectors under the same dimension, but also across dimensions. When usedin machine learning algorithms, this allows to account for feature interaction.

The polynomial kernel is defined as:

where:

x, y are the input vectors
d is the kernel degree

the kernel is said to be homogeneous.

6.8.4. Sigmoid kernel

The function sigmoid_kernel computes the sigmoid kernel between twovectors. The sigmoid kernel is also known as hyperbolic tangent, or MultilayerPerceptron (because, in the neural network field, it is often used as neuronactivation function). It is defined as:

where:

x, y are the input vectors
is known as slope
is known as intercept

6.8.5. RBF kernel

The function rbf_kernel computes the radial basis function (RBF) kernelbetween two vectors. This kernel is defined as:

where x and y are the input vectors. If

the kernel is known as the Gaussian kernel of variance.

6.8.6. Laplacian kernel

The function laplacian_kernel is a variant on the radial basisfunction kernel defined as:

where x and y are the input vectors and

is theManhattan distance between the input vectors.

It has proven useful in ML applied to noiseless data.See e.g. Machine learning for quantum mechanics in a nutshell.

6.8.7. Chi-squared kernel

The chi-squared kernel is a very popular choice for training non-linear SVMs incomputer vision applications.It can be computed using chi2_kernel and then passed to ansklearn.svm.SVC with kernel="precomputed":

>>>

>>> from sklearn.svm import SVC
>>> from sklearn.metrics.pairwise import chi2_kernel
>>> X = [[0, 1], [1, 0], [.2, .8], [.7, .3]]
>>> y = [0, 1, 0, 1]
>>> K = chi2_kernel(X, gamma=.5)
>>> K
array([[1.        , 0.36787944, 0.89483932, 0.58364548],
       [0.36787944, 1.        , 0.51341712, 0.83822343],
       [0.89483932, 0.51341712, 1.        , 0.7768366 ],
       [0.58364548, 0.83822343, 0.7768366 , 1.        ]])
 
>>> svm = SVC(kernel='precomputed').fit(K, y)
>>> svm.predict(K)
array([0, 1, 0, 1])

It can also be directly used as the kernel argument:

>>>

>>> svm = SVC(kernel=chi2_kernel).fit(X, y)
>>> svm.predict(X)
array([0, 1, 0, 1])

The chi squared kernel is given by

The data is assumed to be non-negative, and is often normalized to have an L1-norm of one.The normalization is rationalized with the connection to the chi squared distance,which is a distance between discrete probability distributions.

The chi squared kernel is most commonly used on histograms (bags) of visual words.

References:

Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C.Local features and kernels for classification of texture and objectcategories: A comprehensive studyInternational Journal of Computer Vision 2007https://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf