1.14. Semi-Supervised
Semi-supervised learning is a situationin which in your training data some of the samples are not labeled. Thesemi-supervised estimators in sklearn.semi_supervised
are able tomake use of this additional unlabeled data to better capture the shape ofthe underlying data distribution and generalize better to new samples.These algorithms can perform well when we have a very small amount oflabeled points and a large amount of unlabeled points.
Unlabeled entries in y
It is important to assign an identifier to unlabeled points along with thelabeled data when training the model with the fit
method. The identifierthat this implementation uses is the integer value
.
1.14.1. Label Propagation
Label propagation denotes a few variations of semi-supervised graphinference algorithms.
- A few features available in this model:
Can be used for classification and regression tasks
Kernel methods to project data into alternate dimensional spaces
scikit-learn
provides two label propagation models:LabelPropagation
and LabelSpreading
. Both work byconstructing a similarity graph over all items in the input dataset.
An illustration of label-propagation:the structure of unlabeledobservations is consistent with the class structure, and thus theclass label can be propagated to the unlabeled observations of thetraining set.
LabelPropagation
and LabelSpreading
differ in modifications to the similarity matrix that graph and theclamping effect on the label distributions.Clamping allows the algorithm to change the weight of the true ground labeleddata to some degree. The LabelPropagation
algorithm performs hardclamping of input labels, which means
. This clamping factorcan be relaxed, to say, which means that we will alwaysretain 80 percent of our original label distribution, but the algorithm gets tochange its confidence of the distribution within 20 percent.
LabelPropagation
uses the raw similarity matrix constructed fromthe data with no modifications. In contrast, LabelSpreading
minimizes a loss function that has regularization properties, as such itis often more robust to noise. The algorithm iterates on a modifiedversion of the original graph and normalizes the edge weights bycomputing the normalized graph Laplacian matrix. This procedure is alsoused in Spectral clustering.
Label propagation models have two built-in kernel methods. Choice of kerneleffects both scalability and performance of the algorithms. The following areavailable:
rbf (
). isspecified by keyword gamma.knn (
). is specified by keywordn_neighbors.
The RBF kernel will produce a fully connected graph which is represented in memoryby a dense matrix. This matrix may be very large and combined with the cost ofperforming a full matrix multiplication calculation for each iteration of thealgorithm can lead to prohibitively long running times. On the other hand,the KNN kernel will produce a much more memory-friendly sparse matrixwhich can drastically reduce running times.
Examples
References
[1] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. In Semi-SupervisedLearning (2006), pp. 193-216
[2] Olivier Delalleau, Yoshua Bengio, Nicolas Le Roux. EfficientNon-Parametric Function Induction in Semi-Supervised Learning. AISTAT 2005https://research.microsoft.com/en-us/people/nicolasl/efficient_ssl.pdf