1.6. Nearest Neighbors
sklearn.neighbors
provides functionality for unsupervised andsupervised neighbors-based learning methods. Unsupervised nearest neighborsis the foundation of many other learning methods,notably manifold learning and spectral clustering. Supervised neighbors-basedlearning comes in two flavors: classification for data withdiscrete labels, and regression for data with continuous labels.
The principle behind nearest neighbor methods is to find a predefined numberof training samples closest in distance to the new point, andpredict the label from these. The number of samples can be a user-definedconstant (k-nearest neighbor learning), or vary basedon the local density of points (radius-based neighbor learning).The distance can, in general, be any metric measure: standard Euclideandistance is the most common choice.Neighbors-based methods are known as non-generalizing machinelearning methods, since they simply “remember” all of its training data(possibly transformed into a fast indexing structure such as aBall Tree or KD Tree).
Despite its simplicity, nearest neighbors has been successful in alarge number of classification and regression problems, includinghandwritten digits and satellite image scenes. Being a non-parametric method,it is often successful in classification situations where the decisionboundary is very irregular.
The classes in sklearn.neighbors
can handle either NumPy arrays orscipy.sparse
matrices as input. For dense matrices, a large number ofpossible distance metrics are supported. For sparse matrices, arbitraryMinkowski metrics are supported for searches.
There are many learning routines which rely on nearest neighbors at theircore. One example is kernel density estimation,discussed in the density estimation section.
1.6.1. Unsupervised Nearest Neighbors
NearestNeighbors
implements unsupervised nearest neighbors learning.It acts as a uniform interface to three different nearest neighborsalgorithms: BallTree
, KDTree
, and abrute-force algorithm based on routines in sklearn.metrics.pairwise
.The choice of neighbors search algorithm is controlled through the keyword'algorithm'
, which must be one of['auto', 'ball_tree', 'kd_tree', 'brute']
. When the default value'auto'
is passed, the algorithm attempts to determine the best approachfrom the training data. For a discussion of the strengths and weaknessesof each option, see Nearest Neighbor Algorithms.
Warning
Regarding the Nearest Neighbors algorithms, if twoneighbors
and have identical distancesbut different labels, the result will depend on the ordering of thetraining data.
1.6.1.1. Finding the Nearest Neighbors
For the simple task of finding the nearest neighbors between two sets ofdata, the unsupervised algorithms within sklearn.neighbors
can beused:
>>>
- >>> from sklearn.neighbors import NearestNeighbors
- >>> import numpy as np
- >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
- >>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
- >>> distances, indices = nbrs.kneighbors(X)
- >>> indices
- array([[0, 1],
- [1, 0],
- [2, 1],
- [3, 4],
- [4, 3],
- [5, 4]]...)
- >>> distances
- array([[0. , 1. ],
- [0. , 1. ],
- [0. , 1.41421356],
- [0. , 1. ],
- [0. , 1. ],
- [0. , 1.41421356]])
Because the query set matches the training set, the nearest neighbor of eachpoint is the point itself, at a distance of zero.
It is also possible to efficiently produce a sparse graph showing theconnections between neighboring points:
>>>
- >>> nbrs.kneighbors_graph(X).toarray()
- array([[1., 1., 0., 0., 0., 0.],
- [1., 1., 0., 0., 0., 0.],
- [0., 1., 1., 0., 0., 0.],
- [0., 0., 0., 1., 1., 0.],
- [0., 0., 0., 1., 1., 0.],
- [0., 0., 0., 0., 1., 1.]])
The dataset is structured such that points nearby in index order are nearbyin parameter space, leading to an approximately block-diagonal matrix ofK-nearest neighbors. Such a sparse graph is useful in a variety ofcircumstances which make use of spatial relationships between points forunsupervised learning: in particular, see sklearn.manifold.Isomap
,sklearn.manifold.LocallyLinearEmbedding
, andsklearn.cluster.SpectralClustering
.
1.6.1.2. KDTree and BallTree Classes
Alternatively, one can use the KDTree
or BallTree
classesdirectly to find nearest neighbors. This is the functionality wrapped bythe NearestNeighbors
class used above. The Ball Tree and KD Treehave the same interface; we’ll show an example of using the KD Tree here:
>>>
- >>> from sklearn.neighbors import KDTree
- >>> import numpy as np
- >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
- >>> kdt = KDTree(X, leaf_size=30, metric='euclidean')
- >>> kdt.query(X, k=2, return_distance=False)
- array([[0, 1],
- [1, 0],
- [2, 1],
- [3, 4],
- [4, 3],
- [5, 4]]...)
Refer to the KDTree
and BallTree
class documentationfor more information on the options available for nearest neighbors searches,including specification of query strategies, distance metrics, etc. For a listof available metrics, see the documentation of the DistanceMetric
class.
1.6.2. Nearest Neighbors Classification
Neighbors-based classification is a type of instance-based learning ornon-generalizing learning: it does not attempt to construct a generalinternal model, but simply stores instances of the training data.Classification is computed from a simple majority vote of the nearestneighbors of each point: a query point is assigned the data class whichhas the most representatives within the nearest neighbors of the point.
scikit-learn implements two different nearest neighbors classifiers:KNeighborsClassifier
implements learning based on the
nearest neighbors of each query point, where is an integer valuespecified by the user. RadiusNeighborsClassifier
implements learningbased on the number of neighbors within a fixed radius of eachtraining point, where is a floating-point value specified bythe user.
The
-neighbors classification in KNeighborsClassifier
is the most commonly used technique. The optimal choice of the valueis highly data-dependent: in general a larger suppresses the effectsof noise, but makes the classification boundaries less distinct.
In cases where the data is not uniformly sampled, radius-based neighborsclassification in RadiusNeighborsClassifier
can be a better choice.The user specifies a fixed radius
, such that points in sparserneighborhoods use fewer nearest neighbors for the classification. Forhigh-dimensional parameter spaces, this method becomes less effective dueto the so-called “curse of dimensionality”.
The basic nearest neighbors classification uses uniform weights: that is, thevalue assigned to a query point is computed from a simple majority vote ofthe nearest neighbors. Under some circumstances, it is better to weight theneighbors such that nearer neighbors contribute more to the fit. This canbe accomplished through the weights
keyword. The default value,weights = 'uniform'
, assigns uniform weights to each neighbor.weights = 'distance'
assigns weights proportional to the inverse of thedistance from the query point. Alternatively, a user-defined function of thedistance can be supplied to compute the weights.
Examples:
- Nearest Neighbors Classification: an example ofclassification using nearest neighbors.
1.6.3. Nearest Neighbors Regression
Neighbors-based regression can be used in cases where the data labels arecontinuous rather than discrete variables. The label assigned to a querypoint is computed based on the mean of the labels of its nearest neighbors.
scikit-learn implements two different neighbors regressors:KNeighborsRegressor
implements learning based on the
nearest neighbors of each query point, where is an integervalue specified by the user. RadiusNeighborsRegressor
implementslearning based on the neighbors within a fixed radius of thequery point, where is a floating-point value specified by theuser.
The basic nearest neighbors regression uses uniform weights: that is,each point in the local neighborhood contributes uniformly to theclassification of a query point. Under some circumstances, it can beadvantageous to weight points such that nearby points contribute moreto the regression than faraway points. This can be accomplished throughthe weights
keyword. The default value, weights = 'uniform'
,assigns equal weights to all points. weights = 'distance'
assignsweights proportional to the inverse of the distance from the query point.Alternatively, a user-defined function of the distance can be supplied,which will be used to compute the weights.
The use of multi-output nearest neighbors for regression is demonstrated inFace completion with a multi-output estimators. In this example, the inputsX are the pixels of the upper half of faces and the outputs Y are the pixels ofthe lower half of those faces.
Examples:
Nearest Neighbors regression: an example of regressionusing nearest neighbors.
Face completion with a multi-output estimators: an example ofmulti-output regression using nearest neighbors.
1.6.4. Nearest Neighbor Algorithms
1.6.4.1. Brute Force
Fast computation of nearest neighbors is an active area of research inmachine learning. The most naive neighbor search implementation involvesthe brute-force computation of distances between all pairs of points in thedataset: for
samples in dimensions, this approach scalesas. Efficient brute-force neighbors searches can be verycompetitive for small data samples.However, as the number of samples grows, the brute-forceapproach quickly becomes infeasible. In the classes withinsklearn.neighbors
, brute-force neighbors searches are specifiedusing the keyword algorithm = 'brute'
, and are computed using theroutines available in sklearn.metrics.pairwise
.
1.6.4.2. K-D Tree
To address the computational inefficiencies of the brute-force approach, avariety of tree-based data structures have been invented. In general, thesestructures attempt to reduce the required number of distance calculationsby efficiently encoding aggregate distance information for the sample.The basic idea is that if point
is very distant from point, and point is very close to point,then we know that points andare very distant, without having to explicitly calculate their distance.In this way, the computational cost of a nearest neighbors search can bereduced to or better. This is a significantimprovement over brute-force for large.
An early approach to taking advantage of this aggregate information wasthe KD tree data structure (short for K-dimensional tree), whichgeneralizes two-dimensional Quad-trees and 3-dimensional _Oct-trees_to an arbitrary number of dimensions. The KD tree is a binary treestructure which recursively partitions the parameter space along the dataaxes, dividing it into nested orthotropic regions into which data pointsare filed. The construction of a KD tree is very fast: because partitioningis performed only along the data axes, no
-dimensional distancesneed to be computed. Once constructed, the nearest neighbor of a querypoint can be determined with only distance computations.Though the KD tree approach is very fast for low-dimensional ()neighbors searches, it becomes inefficient as grows very large:this is one manifestation of the so-called “curse of dimensionality”.In scikit-learn, KD tree neighbors searches are specified using thekeyword algorithm = 'kd_tree'
, and are computed using the classKDTree
.
References:
- “Multidimensional binary search trees used for associative searching”,Bentley, J.L., Communications of the ACM (1975)
1.6.4.3. Ball Tree
To address the inefficiencies of KD Trees in higher dimensions, the _ball tree_data structure was developed. Where KD trees partition data alongCartesian axes, ball trees partition data in a series of nestinghyper-spheres. This makes tree construction more costly than that of theKD tree, but results in a data structure which can be very efficient onhighly structured data, even in very high dimensions.
A ball tree recursively divides the data intonodes defined by a centroid
and radius, such that eachpoint in the node lies within the hyper-sphere defined by and. The number of candidate points for a neighbor searchis reduced through use of the triangle inequality:
With this setup, a single distance calculation between a test point andthe centroid is sufficient to determine a lower and upper bound on thedistance to all points within the node.Because of the spherical geometry of the ball tree nodes, it can out-performa KD-tree in high dimensions, though the actual performance is highlydependent on the structure of the training data.In scikit-learn, ball-tree-basedneighbors searches are specified using the keyword algorithm = 'ball_tree'
,and are computed using the class sklearn.neighbors.BallTree
.Alternatively, the user can work with the BallTree
class directly.
References:
- “Five balltree construction algorithms”,Omohundro, S.M., International Computer Science InstituteTechnical Report (1989)
1.6.4.4. Choice of Nearest Neighbors Algorithm
The optimal algorithm for a given dataset is a complicated choice, anddepends on a number of factors:
- number of samples
(i.e. n_samples
) and dimensionality (i.e. n_features
).
- Brute force query time grows as
- Ball tree query time grows as approximately
- KD tree query time changes with
in a way that is difficultto precisely characterise. For small (less than 20 or so)the cost is approximately, and the KD treequery can be very efficient.For larger, the cost increases to nearly, andthe overhead due to the treestructure can lead to queries which are slower than brute force.
For small data sets (
less than 30 or so), iscomparable to, and brute force algorithms can be more efficientthan a tree-based approach. Both KDTree
and BallTree
address this through providing a leaf size parameter: this controls thenumber of samples at which a query switches to brute-force. This allows bothalgorithms to approach the efficiency of a brute-force computation for small.
- data structure: intrinsic dimensionality of the data and/or _sparsity_of the data. Intrinsic dimensionality refers to the dimension
of a manifold on which the data lies, which can be linearlyor non-linearly embedded in the parameter space. Sparsity refers to thedegree to which the data fills the parameter space (this is to bedistinguished from the concept as used in “sparse” matrices. The datamatrix may have no zero entries, but the structure can still be“sparse” in this sense).
Brute force query time is unchanged by data structure.
Ball tree and KD tree query times can be greatly influencedby data structure. In general, sparser data with a smaller intrinsicdimensionality leads to faster query times. Because the KD treeinternal representation is aligned with the parameter axes, it will notgenerally show as much improvement as ball tree for arbitrarilystructured data.
Datasets used in machine learning tend to be very structured, and arevery well-suited for tree-based queries.
- number of neighbors
requested for a query point.
- Brute force query time is largely unaffected by the value of
- Ball tree and KD tree query time will become slower as
increases. This is due to two effects: first, a larger leadsto the necessity to search a larger portion of the parameter space.Second, using requires internal queueing of resultsas the tree is traversed.
As
becomes large compared to, the ability to prunebranches in a tree-based query is reduced. In this situation, Brute forcequeries can be more efficient.
- number of query points. Both the ball tree and the KD Treerequire a construction phase. The cost of this construction becomesnegligible when amortized over many queries. If only a small number ofqueries will be performed, however, the construction can make upa significant fraction of the total cost. If very few query pointswill be required, brute force is better than a tree-based method.
Currently, algorithm = 'auto'
selects 'brute'
if
,the input data is sparse, or effectivemetric
isn’t inthe VALIDMETRICS
list for either 'kd_tree'
or 'ball_tree'
.Otherwise, it selects the first out of 'kd_tree'
and 'ball_tree'
that has effective_metric
in its VALID_METRICS
list.This choice is based on the assumption that the number of query points is atleast the same order as the number of training points, and that leaf_size
is close to its default value of 30
.
1.6.4.5. Effect of leaf_size
As noted above, for small sample sizes a brute force search can be moreefficient than a tree-based query. This fact is accounted for in the balltree and KD tree by internally switching to brute force searches withinleaf nodes. The level of this switch can be specified with the parameterleaf_size
. This parameter choice has many effects:
- construction time
A larger
leaf_size
leads to a faster tree construction time, becausefewer nodes need to be createdquery time
Both a large or small
leaf_size
can lead to suboptimal query cost.Forleaf_size
approaching 1, the overhead involved in traversingnodes can significantly slow query times. Forleaf_size
approachingthe size of the training set, queries become essentially brute force.A good compromise between these isleaf_size = 30
, the default valueof the parameter.memory
- As
leaf_size
increases, the memory required to store a tree structuredecreases. This is especially important in the case of ball tree, whichstores a
-dimensional centroid for each node. The requiredstorage space for BallTree
is approximately 1 / leaf_size
timesthe size of the training set.
leaf_size
is not referenced for brute force queries.
1.6.5. Nearest Centroid Classifier
The NearestCentroid
classifier is a simple algorithm that representseach class by the centroid of its members. In effect, this makes itsimilar to the label updating phase of the sklearn.cluster.KMeans
algorithm.It also has no parameters to choose, making it a good baseline classifier. Itdoes, however, suffer on non-convex classes, as well as when classes havedrastically different variances, as equal variance in all dimensions isassumed. See Linear Discriminant Analysis (sklearn.discriminant_analysis.LinearDiscriminantAnalysis
)and Quadratic Discriminant Analysis (sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis
)for more complex methods that do not make this assumption. Usage of the defaultNearestCentroid
is simple:
>>>
- >>> from sklearn.neighbors import NearestCentroid
- >>> import numpy as np
- >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
- >>> y = np.array([1, 1, 1, 2, 2, 2])
- >>> clf = NearestCentroid()
- >>> clf.fit(X, y)
- NearestCentroid()
- >>> print(clf.predict([[-0.8, -1]]))
- [1]
1.6.5.1. Nearest Shrunken Centroid
The NearestCentroid
classifier has a shrink_threshold
parameter,which implements the nearest shrunken centroid classifier. In effect, the valueof each feature for each centroid is divided by the within-class variance ofthat feature. The feature values are then reduced by shrink_threshold
. Mostnotably, if a particular feature value crosses zero, it is setto zero. In effect, this removes the feature from affecting the classification.This is useful, for example, for removing noisy features.
In the example below, using a small shrink threshold increases the accuracy ofthe model from 0.81 to 0.82.
Examples:
- Nearest Centroid Classification: an example ofclassification using nearest centroid with different shrink thresholds.
1.6.6. Nearest Neighbors Transformer
Many scikit-learn estimators rely on nearest neighbors: Several classifiers andregressors such as KNeighborsClassifier
andKNeighborsRegressor
, but also some clustering methods such asDBSCAN
andSpectralClustering
, and some manifold embeddings suchas TSNE
and Isomap
.
All these estimators can compute internally the nearest neighbors, but most ofthem also accept precomputed nearest neighbors sparse graph,as given by kneighbors_graph
andradius_neighbors_graph
. With modemode='connectivity'
, these functions return a binary adjacency sparse graphas required, for instance, in SpectralClustering
.Whereas with mode='distance'
, they return a distance sparse graph as required,for instance, in DBSCAN
. To include these functions ina scikit-learn pipeline, one can also use the corresponding classesKNeighborsTransformer
and RadiusNeighborsTransformer
.The benefits of this sparse graph API are multiple.
First, the precomputed graph can be re-used multiple times, for instance whilevarying a parameter of the estimator. This can be done manually by the user, orusing the caching properties of the scikit-learn pipeline:
>>>
- >>> from sklearn.manifold import Isomap
- >>> from sklearn.neighbors import KNeighborsTransformer
- >>> from sklearn.pipeline import make_pipeline
- >>> estimator = make_pipeline(
- ... KNeighborsTransformer(n_neighbors=5, mode='distance'),
- ... Isomap(neighbors_algorithm='precomputed'),
- ... memory='/path/to/cache')
Second, precomputing the graph can give finer control on the nearest neighborsestimation, for instance enabling multiprocessing though the parametern_jobs
, which might not be available in all estimators.
Finally, the precomputation can be performed by custom estimators to usedifferent implementations, such as approximate nearest neighbors methods, orimplementation with special data types. The precomputed neighborssparse graph needs to be formatted as inradius_neighbors_graph
output:
a CSR matrix (although COO, CSC or LIL will be accepted).
only explicitly store nearest neighborhoods of each sample with respect to thetraining data. This should include those at 0 distance from a query point,including the matrix diagonal when computing the nearest neighborhoodsbetween the training data and itself.
each row’s
data
should store the distance in increasing order (optional.Unsorted data will be stable-sorted, adding a computational overhead).all values in data should be non-negative.
there should be no duplicate
indices
in any row(see https://github.com/scipy/scipy/issues/5807).if the algorithm being passed the precomputed matrix uses k nearest neighbors(as opposed to radius neighborhood), at least k neighbors must be stored ineach row (or k+1, as explained in the following note).
Note
When a specific number of neighbors is queried (usingKNeighborsTransformer
), the definition of n_neighbors
is ambiguoussince it can either include each training point as its own neighbor, orexclude them. Neither choice is perfect, since including them leads to adifferent number of non-self neighbors during training and testing, whileexcluding them leads to a difference between fit(X).transform(X)
andfit_transform(X)
, which is against scikit-learn API.In KNeighborsTransformer
we use the definition which includes eachtraining point as its own neighbor in the count of n_neighbors
. However,for compatibility reasons with other estimators which use the otherdefinition, one extra neighbor will be computed when mode == 'distance'
.To maximise compatibility with all estimators, a safe choice is to alwaysinclude one extra neighbor in a custom nearest neighbors estimator, sinceunnecessary neighbors will be filtered by following estimators.
Examples:
Approximate nearest neighbors in TSNE:an example of pipelining
KNeighborsTransformer
andTSNE
. Also proposes two custom nearest neighborsestimators based on external packages.Caching nearest neighbors:an example of pipelining
KNeighborsTransformer
andKNeighborsClassifier
to enable caching of the neighbors graphduring a hyper-parameter grid-search.
1.6.7. Neighborhood Components Analysis
Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis
)is a distance metric learning algorithm which aims to improve the accuracy ofnearest neighbors classification compared to the standard Euclidean distance.The algorithm directly maximizes a stochastic variant of the leave-one-outk-nearest neighbors (KNN) score on the training set. It can also learn alow-dimensional linear projection of data that can be used for datavisualization and fast classification.
In the above illustrating figure, we consider some points from a randomlygenerated dataset. We focus on the stochastic KNN classification of point no.3. The thickness of a link between sample 3 and another point is proportionalto their distance, and can be seen as the relative weight (or probability) thata stochastic nearest neighbor prediction rule would assign to this point. Inthe original space, sample 3 has many stochastic neighbors from variousclasses, so the right class is not very likely. However, in the projected spacelearned by NCA, the only stochastic neighbors with non-negligible weight arefrom the same class as sample 3, guaranteeing that the latter will be wellclassified. See the mathematical formulationfor more details.
1.6.7.1. Classification
Combined with a nearest neighbors classifier (KNeighborsClassifier
),NCA is attractive for classification because it can naturally handlemulti-class problems without any increase in the model size, and does notintroduce additional parameters that require fine-tuning by the user.
NCA classification has been shown to work well in practice for data sets ofvarying size and difficulty. In contrast to related methods such as LinearDiscriminant Analysis, NCA does not make any assumptions about the classdistributions. The nearest neighbor classification can naturally produce highlyirregular decision boundaries.
To use this model for classification, one needs to combine aNeighborhoodComponentsAnalysis
instance that learns the optimaltransformation with a KNeighborsClassifier
instance that performs theclassification in the projected space. Here is an example using the twoclasses:
>>>
- >>> from sklearn.neighbors import (NeighborhoodComponentsAnalysis,
- ... KNeighborsClassifier)
- >>> from sklearn.datasets import load_iris
- >>> from sklearn.model_selection import train_test_split
- >>> from sklearn.pipeline import Pipeline
- >>> X, y = load_iris(return_X_y=True)
- >>> X_train, X_test, y_train, y_test = train_test_split(X, y,
- ... stratify=y, test_size=0.7, random_state=42)
- >>> nca = NeighborhoodComponentsAnalysis(random_state=42)
- >>> knn = KNeighborsClassifier(n_neighbors=3)
- >>> nca_pipe = Pipeline([('nca', nca), ('knn', knn)])
- >>> nca_pipe.fit(X_train, y_train)
- Pipeline(...)
- >>> print(nca_pipe.score(X_test, y_test))
- 0.96190476...
The plot shows decision boundaries for Nearest Neighbor Classification andNeighborhood Components Analysis classification on the iris dataset, whentraining and scoring on only two features, for visualisation purposes.
1.6.7.2. Dimensionality reduction
NCA can be used to perform supervised dimensionality reduction. The input dataare projected onto a linear subspace consisting of the directions whichminimize the NCA objective. The desired dimensionality can be set using theparameter n_components
. For instance, the following figure shows acomparison of dimensionality reduction with Principal Component Analysis(sklearn.decomposition.PCA
), Linear Discriminant Analysis(sklearn.discriminant_analysis.LinearDiscriminantAnalysis
) andNeighborhood Component Analysis (NeighborhoodComponentsAnalysis
) onthe Digits dataset, a dataset with size
and. The data set is split into a training and a test setof equal size, then standardized. For evaluation the 3-nearest neighborclassification accuracy is computed on the 2-dimensional projected points foundby each method. Each data sample belongs to one of 10 classes.
Examples:
Comparing Nearest Neighbors with and without Neighborhood Components Analysis
Dimensionality Reduction with Neighborhood Components Analysis
Manifold learning on handwritten digits: Locally Linear Embedding, Isomap…
1.6.7.3. Mathematical formulation
The goal of NCA is to learn an optimal linear transformation matrix of size(n_components, n_features)
, which maximises the sum over all samples
of the probability that is correctlyclassified, i.e.:
with
= n_samples
and the probability of sample being correctly classified according to a stochastic nearestneighbors rule in the learned embedded space:
where
is the set of points in the same class as sample,and is the softmax over Euclidean distances in the embeddedspace:
1.6.7.3.1. Mahalanobis distance
NCA can be seen as learning a (squared) Mahalanobis distance metric:
where
is a symmetric positive semi-definite matrix of size(n_features, n_features)
.
1.6.7.4. Implementation
This implementation follows what is explained in the original paper 1. Forthe optimisation method, it currently uses scipy’s L-BFGS-B with a fullgradient computation at each iteration, to avoid to tune the learning rate andprovide stable learning.
See the examples below and the docstring ofNeighborhoodComponentsAnalysis.fit
for further information.
1.6.7.5. Complexity
1.6.7.5.1. Training
NCA stores a matrix of pairwise distances, taking n_samples ** 2
memory.Time complexity depends on the number of iterations done by the optimisationalgorithm. However, one can set the maximum number of iterations with theargument max_iter
. For each iteration, time complexity isO(n_components x n_samples x min(n_samples, n_features))
.
1.6.7.5.2. Transform
Here the transform
operation returns
, therefore its timecomplexity equals n_components n_features n_samples_test
. There is noadded space complexity in the operation.
References:
- 1
- “Neighbourhood Components Analysis”,J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Advances inNeural Information Processing Systems, Vol. 17, May 2005, pp. 513-520.