6.4. Imputation of missing values

For various reasons, many real world datasets contain missing values, oftenencoded as blanks, NaNs or other placeholders. Such datasets however areincompatible with scikit-learn estimators which assume that all values in anarray are numerical, and that all have and hold meaning. A basic strategy touse incomplete datasets is to discard entire rows and/or columns containingmissing values. However, this comes at the price of losing data which may bevaluable (even though incomplete). A better strategy is to impute the missingvalues, i.e., to infer them from the known part of the data. See theGlossary of Common Terms and API Elements entry on imputation.

6.4.1. Univariate vs. Multivariate Imputation

One type of imputation algorithm is univariate, which imputes values in thei-th feature dimension using only non-missing values in that feature dimension(e.g. impute.SimpleImputer). By contrast, multivariate imputationalgorithms use the entire set of available feature dimensions to estimate themissing values (e.g. impute.IterativeImputer).

6.4.2. Univariate feature imputation

The SimpleImputer class provides basic strategies for imputing missingvalues. Missing values can be imputed with a provided constant value, or usingthe statistics (mean, median or most frequent) of each column in which themissing values are located. This class also allows for different missing valuesencodings.

The following snippet demonstrates how to replace missing values,encoded as np.nan, using the mean value of the columns (axis 0)that contain the missing values:

>>>

  1. >>> import numpy as np
  2. >>> from sklearn.impute import SimpleImputer
  3. >>> imp = SimpleImputer(missing_values=np.nan, strategy='mean')
  4. >>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
  5. SimpleImputer()
  6. >>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
  7. >>> print(imp.transform(X))
  8. [[4. 2. ]
  9. [6. 3.666...]
  10. [7. 6. ]]

The SimpleImputer class also supports sparse matrices:

>>>

  1. >>> import scipy.sparse as sp
  2. >>> X = sp.csc_matrix([[1, 2], [0, -1], [8, 4]])
  3. >>> imp = SimpleImputer(missing_values=-1, strategy='mean')
  4. >>> imp.fit(X)
  5. SimpleImputer(missing_values=-1)
  6. >>> X_test = sp.csc_matrix([[-1, 2], [6, -1], [7, 6]])
  7. >>> print(imp.transform(X_test).toarray())
  8. [[3. 2.]
  9. [6. 3.]
  10. [7. 6.]]

Note that this format is not meant to be used to implicitly store missingvalues in the matrix because it would densify it at transform time. Missingvalues encoded by 0 must be used with dense input.

The SimpleImputer class also supports categorical data represented asstring values or pandas categoricals when using the 'most_frequent' or'constant' strategy:

>>>

  1. >>> import pandas as pd
  2. >>> df = pd.DataFrame([["a", "x"],
  3. ... [np.nan, "y"],
  4. ... ["a", np.nan],
  5. ... ["b", "y"]], dtype="category")
  6. ...
  7. >>> imp = SimpleImputer(strategy="most_frequent")
  8. >>> print(imp.fit_transform(df))
  9. [['a' 'x']
  10. ['a' 'y']
  11. ['a' 'y']
  12. ['b' 'y']]

6.4.3. Multivariate feature imputation

A more sophisticated approach is to use the IterativeImputer class,which models each feature with missing values as a function of other features,and uses that estimate for imputation. It does so in an iterated round-robinfashion: at each step, a feature column is designated as output y and theother feature columns are treated as inputs X. A regressor is fit on (X,y) for known y. Then, the regressor is used to predict the missing valuesof y. This is done for each feature in an iterative fashion, and then isrepeated for max_iter imputation rounds. The results of the finalimputation round are returned.

Note

This estimator is still experimental for now: the predictionsand the API might change without any deprecation cycle. To use it,you need to explicitly import enable_iterative_imputer.

>>>

  1. >>> import numpy as np
  2. >>> from sklearn.experimental import enable_iterative_imputer
  3. >>> from sklearn.impute import IterativeImputer
  4. >>> imp = IterativeImputer(max_iter=10, random_state=0)
  5. >>> imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])
  6. IterativeImputer(random_state=0)
  7. >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
  8. >>> # the model learns that the second feature is double the first
  9. >>> print(np.round(imp.transform(X_test)))
  10. [[ 1. 2.]
  11. [ 6. 12.]
  12. [ 3. 6.]]

Both SimpleImputer and IterativeImputer can be used in aPipeline as a way to build a composite estimator that supports imputation.See Imputing missing values before building an estimator.

6.4.3.1. Flexibility of IterativeImputer

There are many well-established imputation packages in the R data scienceecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turnsout to be a particular instance of different sequential imputation algorithmsthat can all be implemented with IterativeImputer by passing indifferent regressors to be used for predicting missing feature values. In thecase of missForest, this regressor is a Random Forest.See Imputing missing values with variants of IterativeImputer.

6.4.3.2. Multiple vs. Single Imputation

In the statistics community, it is common practice to perform multipleimputations, generating, for example, m separate imputations for a singlefeature matrix. Each of these m imputations is then put through thesubsequent analysis pipeline (e.g. feature engineering, clustering, regression,classification). The m final analysis results (e.g. held-out validationerrors) allow the data scientist to obtain understanding of how analyticresults may differ as a consequence of the inherent uncertainty caused by themissing values. The above practice is called multiple imputation.

Our implementation of IterativeImputer was inspired by the R MICEpackage (Multivariate Imputation by Chained Equations) 1, but differs fromit by returning a single imputation instead of multiple imputations. However,IterativeImputer can also be used for multiple imputations by applyingit repeatedly to the same dataset with different random seeds whensample_posterior=True. See 2, chapter 4 for more discussion on multiplevs. single imputations.

It is still an open problem as to how useful single vs. multiple imputation isin the context of prediction and classification when the user is notinterested in measuring uncertainty due to missing values.

Note that a call to the transform method of IterativeImputer isnot allowed to change the number of samples. Therefore multiple imputationscannot be achieved by a single call to transform.

6.4.4. References

  • 1
  • Stef van Buuren, Karin Groothuis-Oudshoorn (2011). “mice: MultivariateImputation by Chained Equations in R”. Journal of Statistical Software 45:1-67.

  • 2

  • Roderick J A Little and Donald B Rubin (1986). “Statistical Analysiswith Missing Data”. John Wiley & Sons, Inc., New York, NY, USA.

6.4.5. Nearest neighbors imputation

The KNNImputer class provides imputation for filling in missing valuesusing the k-Nearest Neighbors approach. By default, a euclidean distance metricthat supports missing values, nan_euclidean_distances,is used to find the nearest neighbors. Each missing feature is imputed usingvalues from n_neighbors nearest neighbors that have a value for thefeature. The feature of the neighbors are averaged uniformly or weighted bydistance to each neighbor. If a sample has more than one feature missing, thenthe neighbors for that sample can be different depending on the particularfeature being imputed. When the number of available neighbors is less thann_neighbors and there are no defined distances to the training set, thetraining set average for that feature is used during imputation. If there is atleast one neighbor with a defined distance, the weighted or unweighted averageof the remaining neighbors will be used during imputation. If a feature isalways missing in training, it is removed during transform. For moreinformation on the methodology, see ref. [OL2001].

The following snippet demonstrates how to replace missing values,encoded as np.nan, using the mean feature value of the two nearestneighbors of samples with missing values:

>>>

  1. >>> import numpy as np
  2. >>> from sklearn.impute import KNNImputer
  3. >>> nan = np.nan
  4. >>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
  5. >>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
  6. >>> imputer.fit_transform(X)
  7. array([[1. , 2. , 4. ],
  8. [3. , 4. , 3. ],
  9. [5.5, 6. , 5. ],
  10. [8. , 8. , 7. ]])
  • OL2001
  • Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown,Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman,Missing value estimation methods for DNA microarrays, BIOINFORMATICSVol. 17 no. 6, 2001 Pages 520-525.

6.4.6. Marking imputed values

The MissingIndicator transformer is useful to transform a dataset intocorresponding binary matrix indicating the presence of missing values in thedataset. This transformation is useful in conjunction with imputation. Whenusing imputation, preserving the information about which values had beenmissing can be informative. Note that both the SimpleImputer andIterativeImputer have the boolean parameter add_indicator(False by default) which when set to True provides a convenient way ofstacking the output of the MissingIndicator transformer with theoutput of the imputer.

NaN is usually used as the placeholder for missing values. However, itenforces the data type to be float. The parameter missing_values allows tospecify other placeholder such as integer. In the following example, we willuse -1 as missing values:

>>>

  1. >>> from sklearn.impute import MissingIndicator
  2. >>> X = np.array([[-1, -1, 1, 3],
  3. ... [4, -1, 0, -1],
  4. ... [8, -1, 1, 0]])
  5. >>> indicator = MissingIndicator(missing_values=-1)
  6. >>> mask_missing_values_only = indicator.fit_transform(X)
  7. >>> mask_missing_values_only
  8. array([[ True, True, False],
  9. [False, True, True],
  10. [False, True, False]])

The features parameter is used to choose the features for which the mask isconstructed. By default, it is 'missing-only' which returns the imputermask of the features containing missing values at fit time:

>>>

  1. >>> indicator.features_
  2. array([0, 1, 3])

The features parameter can be set to 'all' to return all featureswhether or not they contain missing values:

>>>

  1. >>> indicator = MissingIndicator(missing_values=-1, features="all")
  2. >>> mask_all = indicator.fit_transform(X)
  3. >>> mask_all
  4. array([[ True, True, False, False],
  5. [False, True, False, True],
  6. [False, True, False, False]])
  7. >>> indicator.features_
  8. array([0, 1, 2, 3])

When using the MissingIndicator in a Pipeline, be sure to usethe FeatureUnion or ColumnTransformer to add the indicatorfeatures to the regular features. First we obtain the iris dataset, and addsome missing values to it.

>>>

  1. >>> from sklearn.datasets import load_iris
  2. >>> from sklearn.impute import SimpleImputer, MissingIndicator
  3. >>> from sklearn.model_selection import train_test_split
  4. >>> from sklearn.pipeline import FeatureUnion, make_pipeline
  5. >>> from sklearn.tree import DecisionTreeClassifier
  6. >>> X, y = load_iris(return_X_y=True)
  7. >>> mask = np.random.randint(0, 2, size=X.shape).astype(np.bool)
  8. >>> X[mask] = np.nan
  9. >>> X_train, X_test, y_train, _ = train_test_split(X, y, test_size=100,
  10. ... random_state=0)

Now we create a FeatureUnion. All features will be imputed usingSimpleImputer, in order to enable classifiers to work with this data.Additionally, it adds the the indicator variables fromMissingIndicator.

>>>

  1. >>> transformer = FeatureUnion(
  2. ... transformer_list=[
  3. ... ('features', SimpleImputer(strategy='mean')),
  4. ... ('indicators', MissingIndicator())])
  5. >>> transformer = transformer.fit(X_train, y_train)
  6. >>> results = transformer.transform(X_test)
  7. >>> results.shape
  8. (100, 8)

Of course, we cannot use the transformer to make any predictions. We shouldwrap this in a Pipeline with a classifier (e.g., aDecisionTreeClassifier) to be able to make predictions.

>>>

  1. >>> clf = make_pipeline(transformer, DecisionTreeClassifier())
  2. >>> clf = clf.fit(X_train, y_train)
  3. >>> results = clf.predict(X_test)
  4. >>> results.shape
  5. (100,)