2.6. Covariance estimation

2.6. Covariance estimation

Many statistical problems require the estimation of apopulation’s covariance matrix, which can be seen as an estimation ofdata set scatter plot shape. Most of the time, such an estimation hasto be done on a sample whose properties (size, structure, homogeneity)have a large influence on the estimation’s quality. Thesklearn.covariance package provides tools for accurately estimatinga population’s covariance matrix under various settings.

We assume that the observations are independent and identicallydistributed (i.i.d.).

2.6.1. Empirical covariance

The covariance matrix of a data set is known to be well approximatedby the classical maximum likelihood estimator (or “empiricalcovariance”), provided the number of observations is large enoughcompared to the number of features (the variables describing theobservations). More precisely, the Maximum Likelihood Estimator of asample is an unbiased estimator of the corresponding population’scovariance matrix.

The empirical covariance matrix of a sample can be computed using theempirical_covariance function of the package, or by fitting anEmpiricalCovariance object to the data sample with theEmpiricalCovariance.fit method. Be careful that results dependon whether the data are centered, so one may want to use theassume_centered parameter accurately. More precisely, ifassume_centered=False, then the test set is supposed to have thesame mean vector as the training set. If not, both should be centeredby the user, and assume_centered=True should be used.

Examples:

See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood foran example on how to fit an EmpiricalCovariance objectto data.

2.6.2. Shrunk Covariance

2.6.2.1. Basic shrinkage

Despite being an unbiased estimator of the covariance matrix, theMaximum Likelihood Estimator is not a good estimator of theeigenvalues of the covariance matrix, so the precision matrix obtainedfrom its inversion is not accurate. Sometimes, it even occurs that theempirical covariance matrix cannot be inverted for numericalreasons. To avoid such an inversion problem, a transformation of theempirical covariance matrix has been introduced: the shrinkage.

In scikit-learn, this transformation (with a user-defined shrinkagecoefficient) can be directly applied to a pre-computed covariance withthe shrunk_covariance method. Also, a shrunk estimator of thecovariance can be fitted to data with a ShrunkCovariance objectand its ShrunkCovariance.fit method. Again, results depend onwhether the data are centered, so one may want to use theassume_centered parameter accurately.

Mathematically, this shrinkage consists in reducing the ratio between thesmallest and the largest eigenvalues of the empirical covariance matrix.It can be done by simply shifting every eigenvalue according to a givenoffset, which is equivalent of finding the l2-penalized MaximumLikelihood Estimator of the covariance matrix. In practice, shrinkageboils down to a simple a convex transformation :

Choosing the amount of shrinkage,

amounts to setting abias/variance trade-off, and is discussed below.

Examples:

See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood foran example on how to fit a ShrunkCovariance objectto data.

2.6.2.2. Ledoit-Wolf shrinkage

In their 2004 paper 1, O. Ledoit and M. Wolf propose a formulato compute the optimal shrinkage coefficient

thatminimizes the Mean Squared Error between the estimated and the realcovariance matrix.

The Ledoit-Wolf estimator of the covariance matrix can be computed ona sample with the ledoit_wolf function of thesklearn.covariance package, or it can be otherwise obtained byfitting a LedoitWolf object to the same sample.

Note

Case when population covariance matrix is isotropic

It is important to note that when the number of samples is much larger thanthe number of features, one would expect that no shrinkage would benecessary. The intuition behind this is that if the population covarianceis full rank, when the number of sample grows, the sample covariance willalso become positive definite. As a result, no shrinkage would necessaryand the method should automatically do this.

This, however, is not the case in the Ledoit-Wolf procedure when thepopulation covariance happens to be a multiple of the identity matrix. Inthis case, the Ledoit-Wolf shrinkage estimate approaches 1 as the number ofsamples increases. This indicates that the optimal estimate of thecovariance matrix in the Ledoit-Wolf sense is multiple of the identity.Since the population covariance is already a multiple of the identitymatrix, the Ledoit-Wolf solution is indeed a reasonable estimate.

Examples:

See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood foran example on how to fit a LedoitWolf object to data andfor visualizing the performances of the Ledoit-Wolf estimator interms of likelihood.

References:

1
O. Ledoit and M. Wolf, “A Well-Conditioned Estimator for Large-DimensionalCovariance Matrices”, Journal of Multivariate Analysis, Volume 88, Issue 2,February 2004, pages 365-411.

2.6.2.3. Oracle Approximating Shrinkage

Under the assumption that the data are Gaussian distributed, Chen etal. 2 derived a formula aimed at choosing a shrinkage coefficient thatyields a smaller Mean Squared Error than the one given by Ledoit andWolf’s formula. The resulting estimator is known as the OracleShrinkage Approximating estimator of the covariance.

The OAS estimator of the covariance matrix can be computed on a samplewith the oas function of the sklearn.covariancepackage, or it can be otherwise obtained by fitting an OASobject to the same sample.

Bias-variance trade-off when setting the shrinkage: comparing thechoices of Ledoit-Wolf and OAS estimators

References:

2
Chen et al., “Shrinkage Algorithms for MMSE Covariance Estimation”,IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.

Examples:

See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood foran example on how to fit an OAS objectto data.
See Ledoit-Wolf vs OAS estimation to visualize theMean Squared Error difference between a LedoitWolf andan OAS estimator of the covariance.

2.6.3. Sparse inverse covariance

The matrix inverse of the covariance matrix, often called the precisionmatrix, is proportional to the partial correlation matrix. It gives thepartial independence relationship. In other words, if two features areindependent conditionally on the others, the corresponding coefficient inthe precision matrix will be zero. This is why it makes sense toestimate a sparse precision matrix: the estimation of the covariancematrix is better conditioned by learning independence relations fromthe data. This is known as covariance selection.

In the small-samples situation, in which n_samples is on the orderof n_features or smaller, sparse inverse covariance estimators tend to workbetter than shrunk covariance estimators. However, in the oppositesituation, or for very correlated data, they can be numerically unstable.In addition, unlike shrinkage estimators, sparse estimators are able torecover off-diagonal structure.

The GraphicalLasso estimator uses an l1 penalty to enforce sparsity onthe precision matrix: the higher its alpha parameter, the more sparsethe precision matrix. The corresponding GraphicalLassoCV object usescross-validation to automatically set the alpha parameter.

A comparison of maximum likelihood, shrinkage and sparse estimates ofthe covariance and precision matrix in the very small samplessettings.

Note

Structure recovery

Recovering a graphical structure from correlations in the data is achallenging thing. If you are interested in such recovery keep in mindthat:

Recovery is easier from a correlation matrix than a covariancematrix: standardize your observations before running GraphicalLasso
If the underlying graph has nodes with much more connections thanthe average node, the algorithm will miss some of these connections.
If your number of observations is not large compared to the numberof edges in your underlying graph, you will not recover it.
Even if you are in favorable recovery conditions, the alphaparameter chosen by cross-validation (e.g. using theGraphicalLassoCV object) will lead to selecting too many edges.However, the relevant edges will have heavier weights than theirrelevant ones.

The mathematical formulation is the following:

Where

is the precision matrix to be estimated, and is thesample covariance matrix. is the sum of the absolute values ofoff-diagonal coefficients of. The algorithm employed to solve thisproblem is the GLasso algorithm, from the Friedman 2008 Biostatisticspaper. It is the same algorithm as in the R glasso package.

Examples:

Sparse inverse covariance estimation: example on syntheticdata showing some recovery of a structure, and comparing to othercovariance estimators.
Visualizing the stock market structure: example on realstock market data, finding which symbols are most linked.

References:

Friedman et al, “Sparse inverse covariance estimation with thegraphical lasso”,Biostatistics 9, pp 432, 2008

2.6.4. Robust Covariance Estimation

Real data sets are often subject to measurement or recordingerrors. Regular but uncommon observations may also appear for a varietyof reasons. Observations which are very uncommon are calledoutliers.The empirical covariance estimator and the shrunk covarianceestimators presented above are very sensitive to the presence ofoutliers in the data. Therefore, one should use robustcovariance estimators to estimate the covariance of its real datasets. Alternatively, robust covariance estimators can be used toperform outlier detection and discard/downweight some observationsaccording to further processing of the data.

The sklearn.covariance package implements a robust estimator of covariance,the Minimum Covariance Determinant 3.

2.6.4.1. Minimum Covariance Determinant

The Minimum Covariance Determinant estimator is a robust estimator ofa data set’s covariance introduced by P.J. Rousseeuw in 3. The ideais to find a given proportion (h) of “good” observations which are notoutliers and compute their empirical covariance matrix. Thisempirical covariance matrix is then rescaled to compensate theperformed selection of observations (“consistency step”). Havingcomputed the Minimum Covariance Determinant estimator, one can giveweights to observations according to their Mahalanobis distance,leading to a reweighted estimate of the covariance matrix of the dataset (“reweighting step”).

Rousseeuw and Van Driessen 4 developed the FastMCD algorithm in orderto compute the Minimum Covariance Determinant. This algorithm is usedin scikit-learn when fitting an MCD object to data. The FastMCDalgorithm also computes a robust estimate of the data set location atthe same time.

Raw estimates can be accessed as rawlocation and rawcovarianceattributes of a MinCovDet robust covariance estimator object.

References:

3(1,2)
P. J. Rousseeuw. Least median of squares regression.J. Am Stat Ass, 79:871, 1984.
4
A Fast Algorithm for the Minimum Covariance Determinant Estimator,1999, American Statistical Association and the American Societyfor Quality, TECHNOMETRICS.

Examples:

See Robust vs Empirical covariance estimate foran example on how to fit a MinCovDet object to data and see howthe estimate remains accurate despite the presence of outliers.
See Robust covariance estimation and Mahalanobis distances relevance tovisualize the difference between EmpiricalCovariance andMinCovDet covariance estimators in terms of Mahalanobis distance(so we get a better estimate of the precision matrix too).

Influence of outliers on location and covariance estimates	Separating inliers from outliers using a Mahalanobis distance