Imputation for supervised learning problems in high dimension

Imputation for supervised learning problems in high dimension

Abstract

The problem of missing data often occurs in data analysis. Missing values of the type MAR (Missing At Random) are cosidered here. Then, the probability that a value is missing depends on one or multiple observed variables. Most modern algorithms focus on this type of missing values, and the most used implementations are certainly MICE, missForest, missMDA, or k-Nearest Neighbors imputations. To take into account sampling variability, it is better to propose $M$ values for each missing value instead of a single one. This so-called “multiple imputation” procedure allows to provide proper imputation, in contrast to improper imputation. In practice, $M = 5$ is often sufficient. Most of the existing methods are not well suited to the high dimensional context, when the sample size $n$ is much lower than the number of variables $p$, often symbolized as $n << p$. In supervised analysis, the dependent variable $y$ must be explained by the explanatory variable $x$. This implies that the part of $x$ associated with $y$ can be hard to find, when the classical imputation methodologies suffer. In this communication, a new methodology, called Koh-Lanta, is presented. This methodology is able to deal with missing values in a supervised context, using multiple imputation, and tackling the high dimensional issues. For the sake of simplicity, missing values are considered only in the $x$ part.

Date
Location
Bologne, Italie