1.2. Linear and Quadratic Discriminant Analysis
Linear Discriminant Analysis(discriminant_analysis.LinearDiscriminantAnalysis
) and QuadraticDiscriminant Analysis(discriminant_analysis.QuadraticDiscriminantAnalysis
) are two classicclassifiers, with, as their names suggest, a linear and a quadratic decisionsurface, respectively.
These classifiers are attractive because they have closed-form solutions thatcan be easily computed, are inherently multiclass, have proven to work well inpractice, and have no hyperparameters to tune.
The plot shows decision boundaries for Linear Discriminant Analysis andQuadratic Discriminant Analysis. The bottom row demonstrates that LinearDiscriminant Analysis can only learn linear boundaries, while QuadraticDiscriminant Analysis can learn quadratic boundaries and is therefore moreflexible.
Examples:
Linear and Quadratic Discriminant Analysis with covariance ellipsoid: Comparison of LDA and QDAon synthetic data.
1.2.1. Dimensionality reduction using Linear Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis
can be used toperform supervised dimensionality reduction, by projecting the input data to alinear subspace consisting of the directions which maximize the separationbetween classes (in a precise sense discussed in the mathematics sectionbelow). The dimension of the output is necessarily less than the number ofclasses, so this is, in general, a rather strong dimensionality reduction, andonly makes sense in a multiclass setting.
This is implemented indiscriminant_analysis.LinearDiscriminantAnalysis.transform
. The desireddimensionality can be set using the n_components
constructor parameter.This parameter has no influence ondiscriminant_analysis.LinearDiscriminantAnalysis.fit
ordiscriminant_analysis.LinearDiscriminantAnalysis.predict
.
Examples:
Comparison of LDA and PCA 2D projection of Iris dataset: Comparison of LDA and PCAfor dimensionality reduction of the Iris dataset
1.2.2. Mathematical formulation of the LDA and QDA classifiers
Both LDA and QDA can be derived from simple probabilistic models which modelthe class conditional distribution of the data
for each class. Predictions can then be obtained by using Bayes’ rule:
and we select the class
which maximizes this conditional probability.
More specifically, for linear and quadratic discriminant analysis,
is modeled as a multivariate Gaussian distribution withdensity:
where
is the number of features.
To use this model as a classifier, we just need to estimate from the trainingdata the class priors
(by the proportion of instances of class), the class means (by the empirical sample class means)and the covariance matrices (either by the empirical sample class covariancematrices, or by a regularized estimator: see the section on shrinkage below).
In the case of LDA, the Gaussians for each class are assumed to share the samecovariance matrix:
for all. This leads tolinear decision surfaces, which can be seen by comparing thelog-probability ratios:
In the case of QDA, there are no assumptions on the covariance matrices
of the Gaussians, leading to quadratic decision surfaces. See3 for more details.
Note
Relation with Gaussian Naive Bayes
If in the QDA model one assumes that the covariance matrices are diagonal,then the inputs are assumed to be conditionally independent in each class,and the resulting classifier is equivalent to the Gaussian Naive Bayesclassifier naive_bayes.GaussianNB
.
1.2.3. Mathematical formulation of LDA dimensionality reduction
To understand the use of LDA in dimensionality reduction, it is useful to startwith a geometric reformulation of the LDA classification rule explained above.We write
for the total number of target classes. Since in LDA weassume that all classes have the same estimated covariance, wecan rescale the data so that this covariance is the identity:
Then one can show that to classify a data point after scaling is equivalent tofinding the estimated class mean
which is closest to the datapoint in the Euclidean distance. But this can be done just as well afterprojecting on the affine subspace generated by all the for all classes. This shows that, implicit in the LDAclassifier, there is a dimensionality reduction by linear projection onto a dimensional space.
We can reduce the dimension even more, to a chosen
, by projectingonto the linear subspace which maximizes the variance of the after projection (in effect, we are doing a form of PCA for thetransformed class means). This corresponds to then_components
parameter used in thediscriminant_analysis.LinearDiscriminantAnalysis.transform
method. See3 for more details.
1.2.4. Shrinkage
Shrinkage is a tool to improve estimation of covariance matrices in situationswhere the number of training samples is small compared to the number offeatures. In this scenario, the empirical sample covariance is a poorestimator. Shrinkage LDA can be used by setting the shrinkage
parameter ofthe discriminant_analysis.LinearDiscriminantAnalysis
class to ‘auto’.This automatically determines the optimal shrinkage parameter in an analyticway following the lemma introduced by Ledoit and Wolf 4. Note thatcurrently shrinkage only works when setting the solver
parameter to ‘lsqr’or ‘eigen’.
The shrinkage
parameter can also be manually set between 0 and 1. Inparticular, a value of 0 corresponds to no shrinkage (which means the empiricalcovariance matrix will be used) and a value of 1 corresponds to completeshrinkage (which means that the diagonal matrix of variances will be used asan estimate for the covariance matrix). Setting this parameter to a valuebetween these two extrema will estimate a shrunk version of the covariancematrix.
1.2.5. Estimation algorithms
The default solver is ‘svd’. It can perform both classification and transform,and it does not rely on the calculation of the covariance matrix. This can bean advantage in situations where the number of features is large. However, the‘svd’ solver cannot be used with shrinkage.
The ‘lsqr’ solver is an efficient algorithm that only works for classification.It supports shrinkage.
The ‘eigen’ solver is based on the optimization of the between class scatter towithin class scatter ratio. It can be used for both classification andtransform, and it supports shrinkage. However, the ‘eigen’ solver needs tocompute the covariance matrix, so it might not be suitable for situations witha high number of features.
Examples:
Normal and Shrinkage Linear Discriminant Analysis for classification: Comparison of LDA classifierswith and without shrinkage.
References: